removing all "<...>" from a java String

removing all "<...>" from a java String - java

I have a string and i'd like to remove all tags with < and >
For example:
in the String
<title>Java Code</title>
will be
Java Code
and
<pre><font size="7"><strong>Some text here
</strong></font><strong>
will be
Some text here
How can it be done with using charAt(i)?
Thanks in advance

How can it be done with using charAt(i)?
Here is how:
public static void main(String[] args) {
String s = "<pre><font size=\"7\"><strong>Some text here\n\n</strong></font><strong>";
String o = "";
boolean append = true;
for (int i = 0; i < s.length(); i++) {
if (s.charAt(i) == '<')
append = false;
if (append)
o += s.charAt(i);
if (s.charAt(i) == '>')
append = true;
}
System.out.println(o);
}

It is quite simple to do this using regular expressions.
String src = "<title>Java Code</title>";
String dst = src.replaceAll("<.+?>", "");
System.out.println(dst);

Since you specifically want to use chatAt(i), here is the algorithm,
Start traversing the string from the beginning.
If the character you encounter is an opening tag(<), start traversing the string until you find the closing tag (>). then check the next character, If it is (< ) , then repeat the same process again.
If the next character is not (<), Then start printing the string until you see another (<).
Then repeat step 2.

with charAt, you could loop over all the characters in you string, removing everything from < until the next >. However, your string could contain non-ASCII UTF code points, which could break this approach.
I would go with a regex, something like
String someTextHere = "...";
String cleanedText = someTextHere.replaceAll( "<[^>]*?>", "" );
However, let me also point you to this question, which lists concerns with the regex approach.

Related

How to - Loop through a String and identify a specific character and add to the string

I'm currently trying to loop through a String and identity a specific character within that string then add a specific character following on from the originally identified character.
For example using the string: aaaabbbcbbcbb
And the character I want to identify being: c
So every time a c is detected a following c will be added to the string and the loop will continue.
Thus aaaabbbcbbcbb will become aaaabbbccbbccbb.
I've been trying to make use of indexOf(),substring and charAt() but I'm currently either overriding other characters with a c or only detecting one c.

I know you've asked for a loop, but won't something as simple as a replace suffice?
String inputString = "aaaabbbcbbcbb";
String charToDouble = "c";
String result = inputString.replace(charToDouble, charToDouble+charToDouble);
// or `charToDouble+charToDouble` could be `charToDouble.repeat(2)` in JDK 11+
Try it online.
If you insist on using a loop however:
String inputString = "aaaabbbcbbcbb";
char charToDouble = 'c';
String result = "";
for(char c : inputString.toCharArray()){
result += c;
if(c == charToDouble){
result += c;
}
}
Try it online.

Iterate over all the characters. Add each one to a StringBuilder. If it matches the character you're looking for then add it again.
final String test = "aaaabbbcbbcbb";
final char searchChar = 'c';
final StringBuilder builder = new StringBuilder();
for (final char c : test.toCharArray())
{
builder.append(c);
if (c == searchChar)
{
builder.append(c);
}
}
System.out.println(builder.toString());
Output
aaaabbbccbbccbb

You probably are trying to modify a String in java. Strings in Java are immutable and cannot be changed like one might do in c++.
You can use StringBuilder to insert characters. eg:
StringBuilder builder = new StringBuilder("acb");
builder.insert(1, 'c');

The previous answer suggesting String.replace is the best solution, but if you need to do it some other way (e.g. for an exercise), then here's a 'modern' solution:
public static void main(String[] args) {
final String inputString = "aaaabbbcbbcbb";
final int charToDouble = 'c'; // A Unicode codepoint
final String result = inputString.codePoints()
.flatMap(c -> c == charToDouble ? IntStream.of(c, c) : IntStream.of(c))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
assert result.equals("aaaabbbccbbccbb");
}
This looks at each character in turn (in an IntStream). It doubles the character if it matches the target. It then accumulates each character in a StringBuilder.
A micro-optimization can be made to pre-allocate the StringBuilder's capacity. We know the maximum possible size of the new string is double the old string, so StringBuilder::new can be replaced by () -> new StringBuilder(inputString.length()*2). However, I'm not sure if it's worth the sacrifice in readability.

Java efficiently replace unless matches complex regular expression

I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).
Question: How can I efficiently replace certain characters that don't match a particular regular expression?
I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though.
I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.
Should I do this line by line instead of operating on the whole text?
String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);
//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");
// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");
Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.
I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).
//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();
int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
int pt = strArray[i];
// System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
if (protectedArray[currentPos]==pt) {
if (currentPos == protectedLen - 1) {
resultStr += protectedStrs.get(protectedCount);
protectedCount++;
currentPos = 0;
} else {
currentPos++;
}
} else {
if (currentPos > 0) {
resultStr += replaceStr.substring(0, currentPos);
currentPos = 0;
currentStr = "";
}
resultStr += ParseUtils.getSymbol(pt);
}
}
s = resultStr;
This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?

I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.
You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.
Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:
StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
appendInBetween(resultStr, str, current, protectedM.str());
resultStr.append(protectedM.group());
currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());
where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:
void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
// Pass the whole input string and the bounds, rather than taking a substring.
// Allocate roughly enough space up-front.
resultStr.ensureCapacity(resultStr.length() + end - start);
for (int i = start; i < end; ++i) {
char c = s.charAt(i);
// Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
if (!(Character.isLetter(c)
|| Character.isDigit(c)
|| Character.getType(c) == Character.NON_SPACING_MARK
|| "_\\-<>'".indexOf(c) != -1)) {
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else if (c == '\'' && i > 0 && i + 1 < s.length()) {
// We have a quote that's not at the beginning or end.
// Call these 3 characters bcd, where c is the quote.
char b = s.charAt(i - 1);
char d = s.charAt(i + 1);
if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
// If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
resultStr.append(' ');
resultStr.append(c);
} else if (!Character.isLetter(b) && !Character.isLetter(d)) {
// If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else {
resultStr.append(c);
}
} else {
// Everything else, just append.
resultStr.append(c);
}
}
}
Ideone demo
Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.
I'd be interested to know if this works any faster for you!

At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:
StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
replaceM.appendReplacement(replacedBuff, "");
replacedBuff.append(protectedStrs.get(index));
index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();
Reference: Second answer at this question.
Another option to consider:
During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
String substr = s.substring(currIndex,protectedM.start());
substr = p1.matcher(substr).replaceAll(" $1 ");
substr = p2.matcher(substr).replaceAll("$1 '$2");
substr = p3.matcher(substr).replaceAll("$1 ' $2");
resultStr += substr+protectedM.group();
currIndex = protectedM.end();
}
Speed comparison for 100,000 lines of text:
Original Perl script: 272.960579875 seconds
My first attempt: Too long to finish.
With appendReplacement(): 14.245160866 seconds
Replacing while finding protected: 68.691842962 seconds
Thank you, Java, for not letting me down.

Java - Delete each 4th occurrence of a character in a row

I'm searching for a way to delete each 4th occurrence of a character (a-zA-Z) in a row.
For example, if I have the following string:
helloooo I am veeeeeeeeery busy right nowww because I am working veeeeeery hard
I want delete all 4th, 5th, 6th, ... characters in a row. But, in the word hard, a 4th r occurs, which I do NOT want to delete, because it is not the 4th r in a row / it is surrounded with other characters. The result should be:
hellooo I am veeery busy right nowww because I am working veeery hard
I have already searched for a way to do this, and I could have found a way to replace/delete the 4th occurrence of a character, but I could not find a way to replace/delete the 4th occurrence of a character in a row.
Thanks in advance.

The function may be written like this:
public static String transform(String input) {
if (input.isEmpty()) {
return input;
} else {
final StringBuilder sb = new StringBuilder();
char lastChar = '\0';
int duplicates = 0;
for (int i = 0; i < input.length(); i++) {
final char curChar = input.charAt(i);
if (curChar == lastChar) {
duplicates++;
if (duplicates < 3) {
sb.append(curChar);
}
} else {
sb.append(curChar);
lastChar = curChar;
duplicates = 0;
}
}
return sb.toString();
}
}
I think it's faster than regex.

In Java you can use this replacement based on back-references:
str = str.replaceAll("(([a-zA-Z])\\2\\2)\\2+", "$1");
Code Demo
RegEx Demo

The regex you want is ((.)\2{2})\2*. Not quite sure what that is in Java-ese, but what it does is match any single character and then 2 additional instances of that character, followed by any number of additional instances. Then replace it with the contents of the first capture group (\1) and you're good to go.

How do I replace a unicode Character representing an emoji into a colon delimited String emoji?

I've got a JSON mapping all of the unicode emojis to a colon separated string representation of them (like twitter uses). I've imported the file into an ArrayList of Pair< Character, String> and now need to scan a String message and replace any unicode emojis with their string equivalents.
My code for conversion is the following:
public static String getStringFromUnicode(Context context, String m) {
ArrayList<Pair<Character, String>> list = loadEmojis(context);
String formattedString="";
for (Pair p : list) {
formattedString = message.replaceAll(String.valueOf(p.first), ":" + p.second + ":");
}
return formattedString;
}
but I always get the unicode emoji representation when I send the message to a server.
Any help would be greatly appreciated, thanks!!

When in doubt go back to first principles.
You have a lot of stuff that is all nested together. I have found in such cases that your best approach to solving the problem is to pull it apart and look at what the different pieces are doing. This lets you take control of the problem, and place test code where needed to see what the data is doing.
My best guess is that replaceAll() is acting unpredictably; misinterpreting the emoji string as commands for its regular expression analysis.
I would suggest substituting replaceAll() with a loop of your own that does the same thing. Since we are working with Unicode I would suggest going down deep on this one. This little code sample will do the same thing as replace all, but because I am addressing the string on a character by character basis it should work no matter what funny controls codes are in the string.
String message = "This :-) is a test :-) message";
String find = ":-)";
String replace = "!";
int pos = 0;
//Replicates function of replaceAll without the regular expression analysis
pos = subPos(message,find);
while (pos != -1)
{
String tmp = message.substring(0,pos);
tmp = tmp + replace;
tmp = tmp + message.substring(pos+find.length());
message = tmp;
pos = subPos(message,find);
}
System.out.println(message);
-- Snip --
//Replicates function of indexOf
public static int subPos(String str, String sub)
{
for (int i = 0; i < str.length() - (sub.length() - 1); i++)
{
int j;
for (j = 0; j < sub.length(); j++)
{
System.out.println(i + j);
if (str.charAt(i + j) != sub.charAt(j))
break;
}
if (j == sub.length())
return i;
}
return -1;
}
I hope this helps. :-)

How to tokenize a string using indexOf and substring methods

So I have to tokenize a string, and I can only use these 2 methods to tokenize
I have the base, but I don't know what to put in,
My friend did it, but I forgot how it looked, it went something like this
I remember he split it using the length of a tab
public class Tester
{
private static StringBuffer sb = new StringBuffer ("The cat in the hat");
public static void main(String[] args)
{
for(int i = 0; i < sb.length() ; i++)
{
int tempIndex = sb.indexOf(" ", 0);
sb.substring(0,tempIndex);
if(tempIndex > 0)
{
System.out.println(sb.substring(0,tempIndex));
sb.delete(0, sb.length());
}
}
}
}

String.indexOf(int ch) returns the index of a character. If you do sb.indexOf(' ') you'll get the first index of a space. You can use that in conjunction with substring(): sb.substring(0,sb.indexOf(' ')-1) will get you your first token.
This seems like a homework problem, so I don't want to give you the full answer, but you probably can work it out. Comment if you need more help.

If your are familiar with a while loop construct you can take a look at my pseudocode, should be within the constraints of your problem:
String text = "texty text text"
while(TextHasASapce){
print text up to space
set text to equal all text AFTER the space
}
print ??
Using your two allowed methods the above is convertible line by line to what you are after.
Hope it helps.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

removing all "<...>" from a java String - java

I have a string and i'd like to remove all tags with < and > For example: in the String <title>Java Code</title> will be Java Code and <pre><font size="7"><strong>Some text here </strong></font><strong> will be Some text here How can it be done with using charAt(i)? Thanks in advance

It is quite simple to do this using regular expressions. String src = "<title>Java Code</title>"; String dst = src.replaceAll("<.+?>", ""); System.out.println(dst);

Related

How to - Loop through a String and identify a specific character and add to the string

Java efficiently replace unless matches complex regular expression

Java - Delete each 4th occurrence of a character in a row

How do I replace a unicode Character representing an emoji into a colon delimited String emoji?

How to tokenize a string using indexOf and substring methods

Categories

Resources