Removing all html markup

Removing all html markup - java

I have a string that holds a complete XML get request.
In the request, there is a lot of HTML and some custom commands which I would like to remove.
The only way of doing so I know is by using jSoup.
For example like so.
Now, because the website the request came from also features custom commands, I was not able to completely remove all code.
For example here is a string I would like to 'clean':
\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com\r\n\r\nSome more text here
As you can see, the custom commands all have backslashes in front of them.
How would I go about removing these commands with Java?
If I use regex, how can I program it such that it only removes the command, not anything after the command?
(because if I softcode: I don't know the size of the command beforehand and I don't want to hardcode all the commands).

See http://regex101.com/r/gJ2yN2
The regex (\\.\d{3,}.*?\s|(\\r|\\n)+) works to remove the things you were pointing out.
Result (replacing the match with a single space):
normal text here http://a_random_link_here.com Some more text here
If this was not the result you were looking for, please edit your question with the expected result.
EDIT regex explained:
() - match everything inside the parentheses (later, the "match" gets replaced with "space")
\\ - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
so it is not interpreted as a special character
. - any character (I saw 'u', but there might be others
\d - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s - until you hit a white space
| - or
() - one of these things
\\r - backslash - r (again, with escaped '\')
\\n - backslash - n

The "custom commands" you're showing us appear to be standard character escapes. \r is carriage return, ASCII 13 (decimal). \n is new line, ASCII 10 (decimal). \uxxxx is generally an escape for the Unicode character with that hex value -- for example, \u0027 is ASCII character 39, the apostrophe character ('). You don't want to discard these; they're part of the text content you're trying to retrieve.
So the best answer is to make sure you know which escapes to accept in this dataset and then either find or write code which does a quick linear scan through the code looking for \ and, when found, using the next character to determine which kind of escape it is (and how many subsequent characters belong to that kind of escape), replace the escape sequence with the single character it represents, and continue until you reach the end of the string/buffer/file/whatever.

Related

Using regular expression, how to remove matching sequence at the beginning and ending of the text but keeping what's in the middle?

my problem is very simple but I can't figure out the correct regular expression I should use.
I have the following variable (Java) :
String text = "\033[1mYO\033[0m"; // this is ANSI for bold text in the Terminal
My goal is to remove the ANSI codes with a single regular expression (I just want to keep the plain text at the middle). I cannot modify the text in any way and those ANSI codes will always be at the same place (so one at the beginning, one at the end, though sometimes it's possible that there is none).
With this regular expression, I will remove them using replaceAll method :
String plainText = text.replaceAll(unknownRegex, "");
Any idea on what the unknown regex could be?

Well, you use a single regex that has the ansi codes optionally at the beginning and end, captures anything in between and replaces the entire string with the value of the group: text.replaceAll("^(?:\\\\\\d+\\[1m)?(.*?)(?:\\\\\\d+\\[0m)?$", "$1"). (this might not capture every ansi code - adjust if needed).
Breaking the expression down (note that the example above escapes backslashes for Java strings so they are doubled):
^ is the start of the string
(?:\\\d+\[1m)? matches an optional \<at least 1 digit>[1m
(.*?) matches any text but as little as possible, and captures it into group 1
(?:\\\d+\[0m)? atches an optional \<at least 1 digit>[0m
$ is the end of the input
In the replacement $1 refers to the value of capturing group 1 which is (.*?) in the expression.

Found the answer thanks to a comment that disappeared.
Actually, i just need to make a group to get what's in the middle of the string and using it ($1) to replace the whole thing :
String plainText = text.replaceAll("\\033\\[.*m(.+)\\033\\[.*m", "$1")
Not sure if this will remove every ANSI codes but that is enough for what I want to do.

Java Regex with "Joker" characters

I try to have a regex validating an input field.
What i call "joker" chars are '?' and '*'.
Here is my java regex :
"^$|[^\\*\\s]{2,}|[^\\*\\s]{2,}[\\*\\?]|[^\\*\\s]{2,}[\\?]{1,}[^\\s\\*]*[\\*]{0,1}"
What I'm tying to match is :
Minimum 2 alpha-numeric characters (other than '?' and '*')
The '*' can only appears one time and at the end of the string
The '?' can appears multiple time
No WhiteSpace at all
So for example :
abcd = OK
?bcd = OK
ab?? = OK
ab*= OK
ab?* = OK
??cd = OK
*ab = NOT OK
??? = NOT OK
ab cd = NOT OK
abcd = Not OK (space at the begining)
I've made the regex a bit complicated and I'm lost can you help me?

^(?:\?*[a-zA-Z\d]\?*){2,}\*?$
Explanation:
The regex asserts that this pattern must appear twice or more:
\?*[a-zA-Z\d]\?*
which asserts that there must be one character in the class [a-zA-Z\d] with 0 to infinity questions marks on the left or right of it.
Then, the regex matches \*?, which means an 0 or 1 asterisk character, at the end of the string.
Demo
Here is an alternative regex that is faster, as revo suggested in the comments:
^(?:\?*[a-zA-Z\d]){2}[a-zA-Z\d?]*\*?$
Demo

Here you go:
^\?*\w{2,}\?*\*?(?<!\s)$
Both described at demonstrated at Regex101.
^ is a start of the String
\?* indicates any number of initial ? characters (must be escaped)
\w{2,} at least 2 alphanumeric characters
\?* continues with any number of and ? characters
\*? and optionally one last * character
(?<!\s) and the whole String must have not \s white character (using negative look-behind)
$ is an end of the String

Other way to solve this problem could be with look-ahead mechanism (?=subregex). It is zero-length (it resets regex cursor to position it was before executing subregex) so it lets regex engine do multiple tests on same text via construct
(?=condition1)
(?=condition2)
(?=...)
conditionN
Note: last condition (conditionN) is not placed in (?=...) to let regex engine move cursor after tested part (to "consume" it) and move on to testing other things after it. But to make it possible conditionN must match precisely that section which we want to "consume" (earlier conditions didn't have that limitation, they could match substrings of any length, like lets say few first characters).
So now we need to think about what are our conditions.
We want to match only alphanumeric characters, ?, * but * can appear (optionally) only at end. We can write it as ^[a-zA-Z0-9?]*[*]?$. This also handles non-whitespace characters because we didn't include them as potentially accepted characters.
Second requirement is to have "Minimum 2 alpha-numeric characters". It can be written as .*?[a-zA-Z0-9].*?[a-zA-Z0-9] or (?:.*?[a-zA-Z0-9]){2,} (if we like shorter regexes). Since that condition doesn't actually test whole text but only some part of it, we can place it in look-ahead mechanism.
Above conditions seem to cover all we wanted so we can combine them into regex which can look like:
^(?=(?:.*?[a-zA-Z0-9]){2,})[a-zA-Z0-9?]*[*]?$

Java - Regex Replace All will not replace matched text

Trying to remove a lot of unicodes from a string but having issues with regex in java.
Example text:
\u2605 StatTrak\u2122 Shadow Daggers
Example Desired Result:
StatTrak Shadow Daggers
The current regex code I have that will not work:
list.replaceAll("\\\\u[0-9]+","");
The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2
\u[0-9]+
I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.
Any help would be appreciated.

Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.
Sample would be:
list = list.replaceAll("\\P{Print}", "");
Hope this help.

In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.
Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.
This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.
Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");

I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.
What I had previously:
list.replaceAll("\\\\u[0-9]+","");
What I needed:
list = list.replaceAll("\\\\u[0-9]+","");
Result works fine now, thanks for the help.

Regular expression to return results that do not match selection

I work on a product that provides a Java API to extend it.
The API provides a function which
takes a Perl regular expression and
returns a list of matching files.
I want to filter the list to remove all files that end in .xml, .xsl and .cfg; basically the opposite of .*(\.xml|\.xsl|\.cfg).
I have been searching but I haven't been able to get anything to work yet.
I tried .*(?!\.cfg) and ^((?!cfg).)*$ and \.(?!cfg$|?!xml$|?!xsl$).
I don't know if I am on the right track or not.
Note
I know the regex systems are similar, but I can't get a Java regex working either.

You may use
^(?!.*\.(x[ms]l|cfg)$).+
See the regex demo
Details:
^ - start of a string
(?!.*\.(x[ms]l|cfg)$) - a negative lookahead that fails the match if any 0+ chars other than line break chars (.*) are followed with xml, xsl or cfg ((x[ms]l|cfg)) at the end of the string ($)
.+ - any 1 or more chars other than linebreak chars. Might be omitted if the entire string match is not required (in some tools it is required though).

You need something like this, which matches only if the end of the string isn't preceded by a dot and one of the three unwanted types
/(?<!\.(?:xml|xsl|cfg))\z/

Take string "asdxyz\n" and replace \n with the ascii value

I dont want to replace it with /u000A, do NOT want it to look like "asdxyz/u000A"
I want to replace it with the actual newline CHARACTER.

Based on your response to Natso's answer, it seems that you have a fundamental misunderstanding of what's going on. The two-character sequence \n isn't the new-line character. But it's the way we represent that character in code because the actual character is hard to see. The compiler knows that when it encounters that two-character sequence in the context of a string literal, it should interpret them as the real new-line character, which has the ASCII value 10.
If you print that two-character sequence to the console, you won't see them. Instead, you'll see the cursor advance to the next line. That's because the compiler has already replaced those two characters with the new-line character, so it's really the new-line character that got sent to the console.
If you have input to your program that contains backslashes and lowercase N's, and you want to convert them to new-line characters, then Zach's answer might be sufficient. But if you want your program to allow real backslashes in the input, then you'll need some way for the input to indicate that a backslash followed by a lowercase N is really supposed to be those two characters. The usual way to do that is to prefix the backslash with another backslash, escaping it. If you use Zach's code in that situation, you may end up turning the three-character sequence \\n into the two-character sequence consisting of a backslash followed by a new-line character.
The sure-fire way to read strings that use backslash escaping is to parse them one character at a time, starting from the beginning of the input. Copy characters from the input to the output, except when you encounter a backslash. In that case, check what the next character is, too. If it's another backslash, copy a single backslash to the output. If it's a lowercase N, then write a new-line character to the output. If it's any other character, so whatever you define to be the right thing. (Examples include rejecting the whole input as erroneous, pretending the backslash wasn't there, and omitting both the backslash and the following character.)
If you're trying to observe the contents of a variable in the debugger, it's possible that the debugger may detect the new-line character convert it back to the two-character sequence \n. So, if you're stepping through your code trying to figure out what's in it, you may fall victim to the debugger's helpfulness. And in most situations, it really is being helpful. Programmers usually want to know exactly what characters are in a string; they're less concerned about how those characters will appear on the screen.

Just use String.replace():
newtext = text.replace("\\n", "\n");

Use the "(char)(10)" code to generate the true ascii value.
newstr = oldstr.replaceAll("\\n",(char)(10));
// -or-
newstr = oldstr.replaceAll("\\n","" + ((char)(10)));
//(been a while)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.