Remove ASCII symbol from String [closed] - java

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I need to get rid of a character that looks exactly like the male ascii symbol from text - ♂. However, it's not the standard ASCII symbol, because if I paste it on StackExchange, it's displayed as indicated below:
How can I replace the character within a String? I've tried pasting the character directly into Eclipse but unfortunately that doesn't work (it looks exactly like the image above when pasted into Eclipse). You can see the symbol in Notepad++ however when using the search function:
Howevever, when displayed inline, it looks like this:
Edit: #Greg-449's answer, I've tried that but the character still remains in the String. I don't think it's the default character. I'll show you where you can reference it from a website:
Thermaltake: Chassis > Versa > Versa H21
If you highlight the specifications & choose View selection source you'll notice it start appearing on line 63 after the word (optional).
How can I remove this symbol from the String? If at all possible, is there a way to exclude strange symbols like that in general?
Edit 2. After trying both suggested answers, I'm still not able to remove it from the String. A critical part I now see that I may have left out is that the text is copied from the website, into Microsoft Excel, then into a Java Applet (TextArea) where it is analyzed & manipulated from. Even though not visible in the text area, it still remains there when copied back into Excel after being manipulated.
Code tested is:
String descript = textArea.getText();
descript = descript.replace('\u000B', ' ');
textArea.setText(descript);
When taking this text back into Excel, the character remains.

This is a Unicode symbol so to paste it directly you need to be editing a file with a suitable encoding such as UTF-8 and you need to be using a font that can display the symbol.
In a Java string you can always use the Unicode escape to represent the character. The male symbol is Unicode U+2642 so the string would be:
"\u2642"
Update: Looking at the web site you reference the character is actually a 'vertical tab (VT)' character, Unicode U+000B which explains the 'VT' to see 'displayed inline'. You can use
"\u000B"
for this.
Use something like
String newString = oldString.replace('\u000B', ' ');
to get a new string with the VTs replaced by blanks.

The VT ("vertical tab") character is actually the ASCII character 11, or 0x0b. So it appears that this character is just displayed in a non-standard (neither ASCII nor Unicode) way by some tools.
Knowing that you're looking for the ASCII code 11, you could do char maleChar = (char)11; or String maleStr = "" + ((char)11); and then do your replacement operations based on that.
If, o.t.o.h., the data you have in your string is acutally binary data read for example from a stream, you'd probably be better off using a byte[] or int[] array in the first place.

Related

Remove White Spaces between Specific Substring in a String [duplicate]

This question already has answers here:
Which is the best library for XML parsing in java [closed]
(7 answers)
Closed 5 years ago.
cWhats i want is that all the spaces between <abc> tag to be removed and keep the spaces bwtween <efg> tag
<abc>this is between abc</abc><efg>this is between efg</efg>
<efg>this is between efg</efg><abc>this is between abc</abc>
i want output:
<abc>thisisbetweenabc</abc><efg>this is between efg</efg>
<efg>this is between efg</efg><abc>thisisbetweenabc</abc>
string = string.replaceAll("<abc> </abc>", ""); its not working for me
Brief
I urge you to use an XML parser!!! Anyway, if it's a limited, known set of HTML, you can use the following regex (as per my original comment).
Note: This solution only works on a limited, known set of HTML. If you input differs from what you posted in your question it is likely this solution will not work. See Pshemo's comment below your question.
Note 2: The OP changed the format of the input, thus my original answer will no longer work. See original input below. (Exactly why I put a limited, known set of HTML). In the Code section I've added a second regex that works on the OP's newly added input.
Code
See regex in use here
(?:^(<abc>)|\G(?!^))(\S+)[ \t]*
Replace with $1$2
With the new input format, the following regex can be used (as seen in use here):
(?:^(<abc>)|\G(?!^))([^\s<]+)[ \t]*
Results
Input
<abc>this is between abc</abc>
<efg>this is between efg</efg>
<abc>this is between abc</abc>
<efg>this is between efg</efg>
Output
<abc>thisisbetweenabc</abc>
<efg>this is between efg</efg>
<abc>thisisbetweenabc</abc>
<efg>this is between efg</efg>
Explanation
(?:^(<abc>)|\G(?!^)) Match either of the following
^(<abc>) Match the following
^ Assert position at the start of the line
(<abc>) Capture <abc> literally into capture group 1
\G(?!^) Assert position at the end of the previous match
(\S+) Capture any non-whitespace character one or more times into capture group 2
[ \t]* Match space or tab characters any number of times
Simple just do
xml = my overall string with <abc> and </abc> stuff
start = xml.indexOf('<abc>')
end = xml.indexOf('</abc>')
totalCharsToInclude = end - start (get the length to run from start)
abcOnly = xml.subString(start, totalCharsToInclude),
abcOnly = abcOnly.replace(" ", "")
This is all pseduo code, but you can easily mimic it. You may also have to tweak the indexes with plus or minus, I am not in front of your code to test it, but you should be able to get what you need from this.
Disclaimer: Using XML parser is far better way to handle this, then manipulating strings, but I'll assume you have your reasons, so I'll answer the question you asked, instead of telling you to go get XML parser lol. Good luck.

? is the only output for all unicode above U+0080 in java [duplicate]

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Encoding a string in 128c barcode symbology

I am having some trouble with encoding this string into barcode symbology - Code 128.
Text to encode:
1021448642241082212700794828592311
I am using the universal encoder from idautomation.com:
https://www.bcgen.com/fontencoder/
I get the following output for the encoded text for Code 128:
Í*5LvJ8*r5;ÂoP<[7+.Î
However, in ";Âo" the character between the semi-colon and o (let us call it special A) - is not part of the extended character set used in Code128. (See the Latin Supplements at https://www.fonts2u.com/code-128.font)
Yet the same string shows a valid barcode at
https://www.bcgen.com/linear-barcode-creator.html
How?
If I use the output with the Special A on a webpage with a font face for barcodes, the special A character does not show up as the barcode (and that seems correct since the special A is not part of the character set).
What gives? Please help.
I am using the IDAutomation utility to encode the string to 128c symbology. If you can share code to do the encoding (in Java/Python/C/Perl) that would help too.
There are multiple fonts for Code128 that may use different characters to represent the barcode symbols. Make sure the font and the encoding logic match each other.
I used this one http://www.jtbarton.com/Barcodes/Code128.aspx (there is also sample code how to encode it on the site, but you have to translate it from VB). The font works for all three encodings (A, B and C).
Sorry, this is very late.
When you are dealing with the encoding of code 128, in any subset, it's a good idea to think of that coding in terms of numbers, not characters. At this level, when you have shifts, code-changes, checksums and stuff, intermixed with the data, the whole concept of "character" is lost.
However, this is what is happening:
The semicolon in the output corresponds to "27"
The lowercase o corresponds to "48" and the P to "79"
The "A with Macron" corresponds to your "00" sequence. This is why you should be dealing with numbers, not characters, at this level of encoding.
How would you expect it to show a character with a code of 00 ? That would be a space of NULL, neither of which is particularly visible.
Your software has simply rendered it the best way it can, which is to make the character 'visible' by adding 0x80 to it. If you look at charmap, you will see that code 0x80 is indeed A with macron.
The rest (indeed all) of your encoded string looks correct for a setc-encodation.

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Adding Annotation with Hebrew letters in Itext

When i add annotaion using:
PdfAnnotation.createFileAttachment(writer,null,null , null, , "שם קובץ", "שם קובץ");
the Hebrew letters in the annotaion are not shown.
Is there a way to fix it?
You're using Hebrew characters in your code. That's not safe. Please replace them with a unicode notation (you'll need to know their unicode value; for instance \u00a0 is the value for a non-breaking space). If you don't do this, compilers could interpret the characters incorrectly (see the remarks that were given).
It appears to me that you don't have the correct number of parameters in the method. I assume that you're using this method.
You're using a 'short-cut' method that assumes that the characters aren't Unicode characters. Please don't. Use the method where you create a PdfFileSpecification object, and use methods such as setUnicodeFileName() with the unicode parameter set to true. This way, iText knows that the characters should be interpreted as Unicode characters.
You probably want the characters to appear from right to left. I don't know if this is supported in PDF. I browsed ISO-32000-1 and looked at Table 44 (Entries in a file specification dictionary), but all I saw was: Unicode text string that provides file specification of the form described in 7.11.2, "File Specification Strings." This is a text string encoded using PDFDocEncoding or UTF-16BE with a leading byte-order marker (as defined in 7.9.2.2, "Text String Type"). You'll have to dig into those sections if you want to know more.
You pass null as value for the Rectangle. That doesn't make sense. Are you sure you want to add a file attachment annotation? Based on your code I would assume that you want to add a document-level attachment instead. That's done like this: writer.addFileAttachment(fs); with fs an instance of the FileSpecification class.

Categories