Remove White Spaces between Specific Substring in a String [duplicate] - java

This question already has answers here:
Which is the best library for XML parsing in java [closed]
(7 answers)
Closed 5 years ago.
cWhats i want is that all the spaces between <abc> tag to be removed and keep the spaces bwtween <efg> tag
<abc>this is between abc</abc><efg>this is between efg</efg>
<efg>this is between efg</efg><abc>this is between abc</abc>
i want output:
<abc>thisisbetweenabc</abc><efg>this is between efg</efg>
<efg>this is between efg</efg><abc>thisisbetweenabc</abc>
string = string.replaceAll("<abc> </abc>", ""); its not working for me

Brief
I urge you to use an XML parser!!! Anyway, if it's a limited, known set of HTML, you can use the following regex (as per my original comment).
Note: This solution only works on a limited, known set of HTML. If you input differs from what you posted in your question it is likely this solution will not work. See Pshemo's comment below your question.
Note 2: The OP changed the format of the input, thus my original answer will no longer work. See original input below. (Exactly why I put a limited, known set of HTML). In the Code section I've added a second regex that works on the OP's newly added input.
Code
See regex in use here
(?:^(<abc>)|\G(?!^))(\S+)[ \t]*
Replace with $1$2
With the new input format, the following regex can be used (as seen in use here):
(?:^(<abc>)|\G(?!^))([^\s<]+)[ \t]*
Results
Input
<abc>this is between abc</abc>
<efg>this is between efg</efg>
<abc>this is between abc</abc>
<efg>this is between efg</efg>
Output
<abc>thisisbetweenabc</abc>
<efg>this is between efg</efg>
<abc>thisisbetweenabc</abc>
<efg>this is between efg</efg>
Explanation
(?:^(<abc>)|\G(?!^)) Match either of the following
^(<abc>) Match the following
^ Assert position at the start of the line
(<abc>) Capture <abc> literally into capture group 1
\G(?!^) Assert position at the end of the previous match
(\S+) Capture any non-whitespace character one or more times into capture group 2
[ \t]* Match space or tab characters any number of times

Simple just do
xml = my overall string with <abc> and </abc> stuff
start = xml.indexOf('<abc>')
end = xml.indexOf('</abc>')
totalCharsToInclude = end - start (get the length to run from start)
abcOnly = xml.subString(start, totalCharsToInclude),
abcOnly = abcOnly.replace(" ", "")
This is all pseduo code, but you can easily mimic it. You may also have to tweak the indexes with plus or minus, I am not in front of your code to test it, but you should be able to get what you need from this.
Disclaimer: Using XML parser is far better way to handle this, then manipulating strings, but I'll assume you have your reasons, so I'll answer the question you asked, instead of telling you to go get XML parser lol. Good luck.

Related

Regular expression to split text on space and newline? [duplicate]

This question already has answers here:
Regex match empty lines
(2 answers)
Closed 4 years ago.
I have a String that looks as follows:
This is line number 1.
[space][space][space][space]\n
[space]\n
This is line number 2.
where every [space] represents a blank space and \n represents a new line.
What I would like to do is to split this string into two strings, one that has "This is line number 1." and the other that contains "This is line number 2." In other words split the string on every two empty lines regardless of whether they contain spaces or not.
What I tried to do:
System.out.println(myString.split("^[ ]{0,}\\n")[0]);
But the above prints the whole string.
UPDATE
Other things I have tried that also print the whole string and don't seem to work:
System.out.println(myString.split("(^[ ]{0,}\\n){2,}")[0]);
These all print the whole string as well. Any ideas?
Simply enable the multiline flag in your pattern like this.
myString.split("(?m)^[ ]{0,}\n")
The ?m character adds a multiline flag you can pass without using Java's Regex class.
This should work, not sure if you get the extra split caused by the first line.
I'm just briefly looking through things on a break at work so perhaps I haven't read the question thoroughly enough, but have you tried:
System.out.println(parsedText.split("^[ ]{0,}\n\n")[0]);
Seems like you are not completely skipping two lines in your code. Once again might be wrong but worth a shot!

Regex matches only part of a URL - why?

I am very weak in regex and the regex I am using (found from internet) is only partially solving my problem. I need to add an anchor tag to a URL from text input using java. Here is my code:
String text ="Hi please visit www.google.com";
String reg = "\\b(([\\w-]+://?|www[.])[^\\s()<>]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|/)))";
String s = text.replaceAll(reg, "<a href='$1'>$1</a>");
System.out.println(""+s);
The output currently is Hi please visit <a href='www.google.c'>www.google.c</a>om. What's wrong with the regex?
I need to parse a text and display a URL entered from text field as hot link in a jsp page. The actual output expected would be
Hi please visit <a href='www.google.com'>www.google.com</a>
Edit
Following regex
(http(s)?://)?(www(\.\w+)+[^\s.,"']*)
works like a charm in url ending with .com but fails in other extensions like .jsp.Is there any way for it to work in all sort of extension?
To answer your question why the regex doesn't work: It doesn't observe Java's regex syntax rules.
Specifically:
[^[:punct:]\s]
doesn't work as you expect it to because Java doesn't recognize POSIX shorthands like [:punct:]. Instead, it treats that as a nested character class. That again leads to the ^ becoming illegal in that context, so Java ignores it, leaving you with a character class that matches the same as
[:punct\s]
which only matches the c of com, therefore ending your match there.
As for your question of how to find URLs in a block of text, I suggest you read Jan Goyvaert's excellent blog entry Detecting URLs in a block of text. You'll need to decide yourself how sensitive and how specific you want to make your regex.
For example, the solution proposed at the end of the post would translate to Java as
String resultString = subjectString.replaceAll(
"(?imx)\\b(?:(?:https?|ftp|file)://|www\\.|ftp\\.)\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [-A-Z0-9+&#\\#/%=~_|$?!:,.])*\n" +
"(?:\\([-A-Z0-9+&#\\#/%=~_|$?!:,.]*\\)|\n" +
" [A-Z0-9+&#\\#/%=~_|$])", "$0");
Java recognises posix expressions (see javadoc), but the syntax is a little different. It looks like this instead:
\p{Punct}
But I would simplify your regex for a URL to:
(?i)(http(s)?://)?((www(\.\w+)+|(\d{1,3}\.){3}\.\d{1,3})[^\s,"']*(?<!\\.))
And elaborate it only if you find a test case that breaks it.
As a java line it would be:
text = text.replaceAll("(?i)(http(s)?://)?((www(\\.\w+)+|(\\d{1,3}\\.){3}\\d{1,3})[^\\s,\"']*(?<!\\.))", "$3");
Note the neat capture of the "s" in "https" (if found) that is restored if required.

Regular expression, excluding .. in suffix of email addy [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using a regular expression to validate an email address
This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.
I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.
First, the addresses.
me#example..com - invalid
someone.nothere#1.0.0.127 - valid
The part of my expression which validates the suffix
I originally started with
#.+\\.[[a-z]0-9]+
And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.
#[[\\w]+\\.[\\w]+]+
or
#[\\w]+\\.[\\w]+
I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.
I know what I want it to do, I want it to match a character class of "character+"."character+"+
The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.
I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.
You need a negative look-ahead :
#\w+\.(?!\.)
See http://www.regular-expressions.info/lookaround.html
test in Perl :
Perl> $_ = 'someone.nothere#1.0.0.127'
someone.nothere#1.0.0.127
Perl> print "OK\n" if /\#\w+\.(?!\.)/
OK
1
Perl> $_ = 'me#example..com'
me#example..com
Perl> print "OK\n" if /\#\w+\.(?!\.)/
Perl>
#([\\w]+\\.)+[\\w]+
Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.
I think you want this:
#[\\w]+(\\.[\\w]+)+
This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)
The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.
The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

is it possible to use replaceAll() with wildcards

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;
You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.
The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.

Regex JAVA return a chain between delimiters [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I'm trying to retrieve a string being between delimiters...
Here the sample :
<TAG> x1 x2 y1 y2 </TAG>
I want my regex to return TAG
Could you also provide a link to a good regex documentation please ?
What you are doing is possibly okay, as long as the tags are not recursive, otherwise it is not a good idea to do so! (a funny read).
If you are trying to write regex to get something in between those tags, and if that is the only exact case you wish to handle:
You need to capture the name in tag 1. See this - this is done by enclosing in parentheses.
regex = "<(.*?)>".
The question mark is to make sure the shortest string (non-greedy) is matched - which is TAG in your case. If you just give <.*> it matches the whole expression, since regular expressions by default tend to match the longest string. The parentheses store the tag name so that it can be used in step 2.
Then you need to make sure the closing tag has the same one - which needs a back reference to the captured group.(See this). So, here you need to write:
regex = "<(.*?)>.*</\1>"
The \1 is the back reference to the expression captured in the first set of parentheses.
I did not test it myself, but it should give you an idea of the concepts you need to use to write such an expression.

Categories