RegEx issue using JAVA

RegEx issue using JAVA - java

After a week of searching the web and trying different approaches, i give up.
I am facing an issue with regEx in Java and i am wondering if i can find some help here.
I am trying to find this "< < < < 06 76 > > " pattern in a huge string that i have to search through.
What i know is that, between the last "<" and the first ">" there can only be numbers type characters and any amount of spaces between the last ">" and the first "<". Also, between each "<" or ">" can be from 1 to 5 spaces.
I was able to create part of a pattern to use for my search, but i cant move forward from there.
Here is what i was able to create as a search pattern.
String tag_open = "<\\s{0,4}<\\s{0,4}<\\s{0,4}<\\s{0,4}";
I am stuck trying to include the idea of "any numbers, not more than 4 digits, separated by 1 to 5 spaces".
Finally, i am able to "close" the pattern to be searched with
"\\s{0,4}>\\s{0,4}>\\s{0,4}"
Sorry for the long text. I am trying to be as detailed as possible.
Thanks so much!
Regards.
I think i forgot to say something... I actually did... There are 2 types of "tags" that i have to look for... One is " < < < < 06 76 > > " and the second one is " < < 39 85 > > > > ". Where, the amount of spaces between each "<" and ">" can be from 1 to 4 and the same amount of spaces between the last "<" and the first number character. The same idea is between the last number character and the first ">". Last, from 1 to 6 spaces between the numbers.
Ok... Hope its my last edit. :-)
I have to find the position of those TWO type of tags that will show me the begging and the end of each paragraph. The begging of the paragraph is establish by the pattern:
Start of paragraph: Four "<<<<"* + some spaces + 2 random digits + some spaces + 2 random digits + some spaces + Two ">>"*.
between the "<" and ">" can be between 1 to 4 spaces.
End of paragraph: Two "<<"* + some spaces + 2 random digits + some spaces + 2 random digits + some spaces + Four ">>>>"*.
between the "<" and ">" can be between 1 to 4 spaces.
Here is an example of a text paragraph:
< < < < 06 76 > >
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec sit amet mauris lorem. Etiam aliquam iaculis tellus, ac accumsan velit. Vivamus venenatis diam sit amet elementum sollicitudin. Curabitur nec finibus tellus. Proin vestibulum placerat diam. Sed eget risus volutpat, placerat arcu non, commodo ex. Vivamus et ipsum efficitur, ornare nisi sit amet, venenatis diam. Sed aliquet lacinia nulla eu mattis. Integer dapibus, odio a rhoncus porttitor, tellus ligula imperdiet sem, at semper magna arcu a mauris. Vestibulum accumsan ornare aliquet. Curabitur a mollis ex, a ullamcorper enim. Donec urna nibh, vestibulum ut gravida vel, posuere id elit. Proin ut fringilla turpis.
< < 06 76 > > > >
< < < < 12 23 > >
Morbi aliquet condimentum tempus. Fusce quis rutrum lacus. Curabitur blandit vestibulum lacinia. Ut ac maximus dolor. Suspendisse potenti. Sed quis turpis felis. Sed magna mauris, mattis non mi id, mollis posuere massa. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Suspendisse dictum sapien bibendum dictum ultricies. Suspendisse sed lectus egestas, congue ligula quis, fringilla sapien. Nullam et odio elit. Nullam pellentesque nunc tellus, vitae pharetra lorem congue id.
< < 12 23 > > > >
Again, sorry for the long post and the many last minute edits.

Something like this?
String input = "< < < < 06 76 > > ";
//For all tags
Pattern pat = Pattern.compile("(< +)+([0-9]+ +)+(> +)+");
//For tag < < < < 06 76 > >
//Pattern pat = Pattern.compile("(< +){4}([0-9]+ +)+(> +)+");
//For tag < < 39 85 > > > >
//Pattern pat = Pattern.compile("(< +){2}([0-9]+ +)+(> +)+");
Matcher mat = pat.matcher(input);
while(mat.find()) {
System.out.println(mat.group());
}
//Prints:
//< < < < 06 76 > >

You might use
<(?:\s{1,5}<)*\s*(?:\d+\s*)+>(?:\s{1,5}>)*
< Match literally
(?:\s{1,5}<)* Repeat 0+ times matches 1-5 whitespace chars followed by <
\s* Match optional whitespace chars
(?:\d+\s*)+ Match 1+ times matching 1+ digits and optional whitespace chars
> Match literally
(?:\s{1,5}>)+ Repeat 0+ times matching 1-5 whitespace chars followed by >
Regex demo | Java demo
Note that \s could also match a newline. In Java you might also use \h{1,5} to match horizontal whitespace chars.
Example in Java:
String regex = "<(?:\\s{1,5}<)*\\s*(?:\\d+\\s*)+>(?:\\s{1,5}>)*";
String string = "< < < < 06 76 > >\n"
+ "< < < < > >\n"
+ "< < < < 06 76 >\n"
+ "< < < < 06 76 \n";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output
< < < < 06 76 > >
< < < < 06 76 >
EDIT
The pattern for the start of the paragraph
<(?:\s{1,4}<){3}(?:\s*\d\d){2}\s*>\s{1,4}>
Regex demo
The pattern for the end of the paragraph
<\s{1,4}<(?:\s*\d\d){2}\s*>(?:\s{1,4}>){3}
Regex demo
If you want for example to get the content of the paragraph, you could use a capture group.
<(?:\s{1,4}<){3}(?:\s*\d\d){2}\s*>\s{1,4}>([\s\S]*?)<\s{1,4}<(?:\s*\d\d){2}\s*>(?:\s{1,4}>){3}
Regex demo

Related

Find and replace text iteratively from a list

Suppose I have these text :
Donec sollicitudin ? malesuada. "Curabitur" arcu erat, accumsan
id imperdiet et, porttitor at sem. Quisque velit nisi, ? ut
lacinia in, ? id enim. Proin eget tortor risus.
and I have these texts in list :
["apple", "banana", "cherry"]
How can I replace each occurence of ? with each of the text in the list ? Expected output :
Donec sollicitudin apple malesuada. "Curabitur" arcu erat, accumsan
id imperdiet et, porttitor at sem. Quisque velit nisi, banana ut
lacinia in, cherry id enim. Proin eget tortor risus.
Is it possible to use notepad++ to achieve something like this for a longer text and list? Or is there any other technologies that I can use ?

This Python script will get the job done. If there are more ? than replacements in the list, it will leave them as ?.
import re
replacements = ["apple", "banana", "cherry"]
lines = ""
with open("file.txt") as file:
lines = file.read()
def replace(m):
if not replacements:
return "?"
return replacements.pop(0)
lines = re.sub(r"\?", replace, lines)
with open("file.txt", "w") as file:
file.write(lines)
Admittedly, there are better ways of doing this, such as not loading the entire file into a string.

You could try doing three regex replacements in succession:
Find:
([^?]*)?\?(.*)
Replace:
$1apple$2
The trick here is that ([^?]*)?\? matches everything up until the first question mark. This allows us to do a controlled replacement of only one ? placeholder at a time.
You would then repeat the above the above replacement from left to right for the other two keywords.

You can use below regex:
\?(?!(.|\s)*\?(.|\s)*)
It will pick the last ? and provide you the index of it. After that you can replace it with the last element of your array (it would be better if you create a stack which contains ["apple", "banana", "cherry"] so that stack.pop method will always give you the last element.)

In Perl:
$text =~ s{\?}{shift #fruits}eg; # Consumes #fruits array
Or
my $i = 0;
$text =~ s{\?}{$fruits[$i++]}g; # Preserves #fruits
To cycle over #fruits (if the number of ?s exceeds the number of fruits):
my $i = 0;
$text =~ s{\?}{$fruits[ $i++ % #fruits ]}g;

Extract all substrings beginning and ending with a regex from large string

I have a large string which contains multiline-substrings between two constant marker-strings, which I can identify with a regex.
For simplification I named them abcdef and fedcba here:
abcdef Sed lobortis nisl sed malesuada bibendum. fedcba
...
abcdef Fusce odio turpis, accumsan non posuere placerat.
1
2
3
fedcba
abcdef Aliquam erat volutpat. Proin ultrices fedcba
How can I get all the occurrences including the markers from the large string?

Something like
Pattern r = Pattern.compile("abcdef[\\s\\S]*?fedcba");
Matcher m = r.matcher(sInput);
if (m.find( )) {
System.out.println("Found value: " + m.group() );
}
where sInput is your string to search.
[\s\S]*? will match any number of any character up to the following fedcba. Thanks to the ? it's a non-greedy match, which means it won't continue until the last fedcba (as it would if it was greedy), thus giving you the separate strings.

REGEXP:
(?:\babcdef)(?:.*\n)*(?:\bfedcba)
JAVA:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "(?:\\babcdef)(?:.*\\n)*(?:\\bfedcba)";
final String string = "patata\n"
+ "abcdef\n"
+ "Aliquam erat volutpat. Proin ultrices\n"
+ "Testing\n\n"
+ "test[](test)\n"
+ "Testing\n"
+ "fedcba\n"
+ "Testing\n\n\n\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
ORIGINAL TEXT:
patata
abcdef
Aliquam erat volutpat. Proin ultrices
Testing
test[](test)
Testing
fedcba
Testing
RESULT:
abcdef
Aliquam erat volutpat. Proin ultrices
Testing
test[](test)
Testing
fedcba
See: https://regex101.com/r/xXaLgN/5
Enjoy.
Do not forget that if I help you, mark me as the answer to the question.

time complexity Java

The program counts maximum number of words in a sentence considering a text is given.A text can have multiple sentences. I have to find sentence with maximum words
I have the following code I need to optimize the time complexity for this
it should not take more than 5 sec
import java.util.*;
import java.io.*;
class Solution {
public int solution(String S) {
// write your code in Java SE 8
List<Integer> wca=new ArrayList<Integer>();
int wc,i;
String[] sent=S.split("\\.+");
while(sent.length!=0){
for(i=0;i<sent.length;i++){
wc=sent[i].split("\\s+").length;
wca.add(wc);
}
}
Collections.sort(wca);
return(wca.get(wca.size()-1));
}
}

You don't need to sort the list to simply find its largest value. In fact you don't need a list at all. Simply store the longest sentence as you go along.
public int findLongestSentence(String paragraph) {
String[] sentences = paragraph.split("\\.|\\!|\\?");
int maxSentenceLength = 0;
for(String sentence : sentences) {
String[] words = sentence.split("\\s");
maxSentenceLength = Math.max(words.length, maxSentenceLength);
}
return maxSentenceLength;
}
This could be made more efficient by not using the split() method, but that would not affect the asymptotic time complexity.
P.S. Informative variable names are important, and along with good code formatting, made your code much easier to read.

The program counts maximum number of words in a sentence
Suppose you have this text:
Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Split by dot (.)
arr[0]= Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua
arr[1]= Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat
arr[2]= Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur
arr[3]= Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Amount of words is related with amount of spaces, so count the spaces will be easier.
int max = 0; // this hold the maximal space count
int index = 0; // this hold the index of the maximal space count
Iterate over the array
int spaces = arr[i].length - arr[i].replace(" ","").length;
if(spaces>max){
max = spaces;
index = i;
}
At the end of that loop you will get the maximal amount of words and the index of the array of that sentence.

From what i understand, you want to parse an input text, so you can get the wordcount of each sentence and find the sentence witht the highest one.
First of all, you are only returning the highest wordcount, nothing to identify the sentence itself.
Second (as others have already pointed out) the sorting can be substituted with keeping only the longest sentence, and upon finding a longer one, replacing it. That would indeed bring it to O(n).
Third is the problem that sentences dont only end with periods.
String longest;
for (String s : sentences){
if(s.split(" ").length>longest.split(" ").length))
{
longest = s
}
}
return longest;

How to match any word but ignore those that starts with multiple whitespaces?

What I am trying to achieve is to match all words in text, but ignore those words in line (before new line) that start with 4 whitespaces.
Example
Text file to find words:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat.
This must NOT be matched. Because it has 4 whitespaces at the beginning.
Lorem ipsum dolor sit amet. Ut enim ad minim veniam.
So, the words in following line should be NOT considered to match pattern:
This must NOT be matched. Because it has 4 whitespaces at the beginning.
Code
Here is my regex and it can find all words:
\\b[A-Za-z]+\\b
I know that in Java's RegEx syntax there is except which is ^ symbol but I only know how to use it in more simple expressions.

Maybe following snippet could be a basis for what you want to achieve.
String[] lines = {"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do",
"eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut",
"enim ad minim veniam, quis nostrud exercitation ullamco laboris",
"nisi ut aliquip ex ea commodo consequat.",
"",
" This must NOT be matched. Because it has 4 whitespaces at the beginning.",
"",
"Lorem ipsum dolor sit amet. Ut enim ad minim veniam."};
for (String line : lines) {
if (!line.startsWith(" ")) {
String[] words = line.split("[\\p{IsPunctuation}\\p{IsWhite_Space}]+");
System.out.println("words = " + Arrays.toString(words));
}
}
output
words = [Lorem, ipsum, dolor, sit, amet, consectetur, adipiscing, elit, sed, do]
words = [eiusmod, tempor, incididunt, ut, labore, et, dolore, magna, aliqua, Ut]
words = [enim, ad, minim, veniam, quis, nostrud, exercitation, ullamco, laboris]
words = [nisi, ut, aliquip, ex, ea, commodo, consequat]
words = []
words = []
words = [Lorem, ipsum, dolor, sit, amet, Ut, enim, ad, minim, veniam]
PS: the regex has been borrowed from this answer

The following should do that
(?<!\s{4})\\b[A-Za-z]+\\b
It begins with a negative lookbehind so it won't match anything with \s{4} preceding it.

Regex to extract Content-Type

How can extract the lines with the Content-Type info? In some mails, these headers can be in 2 or 3 or even 4 lines, depending how it was sent. This is one example:
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
I tried this regex: ^(Content-.*:(.|\n)*)* but it grabs everything.
How should I phrase my regex in Java to get only part:
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit

Pattern regex = Pattern.compile("^Content-Type(?:.|\\s)*?(?=\n\\s+\n)");
This will match everything which starts with Content-Type until the first completely empty line.

You can try this regex
Pattern regex = Pattern.compile("Content-Type.*?(?=^\\s*\n?\r?$)",
Pattern.DOTALL | Pattern.MULTILINE);

^Content-(.|\n)*\n\n
This will match until the blank line.

Checkout the relevant RFCs for the exact definition of headers. IIRC in essence you need to consider everything with a linebreak and one or more whitespace characters (eg space, nonbreaking space, tab) to be part of the same header line. I also believe that you should collapse the linebreak and whitespace(s) into a single whitespace element (note: there might be more complex rules, so check the RFCs).
Only if the new line directly starts with a non-whitespace character it is the next header, and if it is immediately followed by another linebreak it ends the header section and starts the body section.
BTW: Why not just use JavaMail instead of reinventing the wheel?

This tested script works for me:
import java.util.regex.*;
public class TEST
{
public static void main( String[] args )
{
String subjectString =
"Content-Type: text/plain;\r\n" +
" charset=\"us-ascii\"\r\n" +
"Content-Transfer-Encoding: 7bit\r\n" +
"\r\n" +
"Lorem ipsum dolor sit amet, consectetur adipisicing elit,\r\n" +
"sed do eiusmod tempor incididunt ut labore et dolore magna\r\n" +
"aliqua. Ut enim ad minim veniam, quis nostrud exercitation\r\n" +
"ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n" +
"Duis aute irure dolor in reprehenderit in voluptate velit\r\n" +
"esse cillum dolore eu fugiat nulla pariatur. Excepteur sint\r\n" +
"occaecat cupidatat non proident, sunt in culpa qui officia\r\n" +
"deserunt mollit anim id est laborum.\r\n";
String resultString = null;
Pattern regexPattern = Pattern.compile(
"^Content-Type.*?(?=\\r?\\n\\s*\\n)",
Pattern.DOTALL | Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regexPattern.matcher(subjectString);
if (regexMatcher.find()) {
resultString = regexMatcher.group();
}
System.out.println(resultString);
}
}
It works for text having both valid: \r\n and (invalid but commonly used in the wild): \n Unix style line terminations.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RegEx issue using JAVA - java

Related

Find and replace text iteratively from a list

Extract all substrings beginning and ending with a regex from large string

time complexity Java

How to match any word but ignore those that starts with multiple whitespaces?

Regex to extract Content-Type

Categories

Resources