Cleaning an inputstring containing binary junk to produce an ascii printable string

Cleaning an inputstring containing binary junk to produce an ascii printable string - java

In our application we have a textfield that is controlled by TinyMCE. If the customer pastes text from Word into the textfield, Oracle balks when we are trying to store this text in our database:
ORA-01461: can bind a LONG value only for insert into a LONG column
Cleaning the text in, say Notepad, will not produce any problems, so my guess is that the problem lies in the input string containing some kind of binary junk that Oracle uses as a delimiter between the values that are used in the sql insert string.
Upgrading our ancient TinyMCE will probably fix the problem, but I also want to ensure the text really is clean when passed to the lower layers. So I thought that I might ensure the text is true ASCII, and if not, clean everything that does not pass as ASCII by looping through the lines in the input and do the following:
line.replaceAll("[^\\p{ASCII}]", "")
Is this a viable solution, and if not, what are the pitfalls?

What about cleaning the pastes content like i described here?
This might also remove junk.

Related

When I use getRawSignature() to get the comment which has " in it, it is writing some improper string "â€œ"

When I use getRawSignature() to get the comment which has " in it, it is writing some improper string "â€œ". How to resolve this to get correct output? is there any alternate funtion WE have from IASTNode?

â€œ is mojibake - almost always the result of having the unicode symbol “ (U+201C, the left double quotation mark - that is not a normal quote! It is slanted), which you then convert into bytes by using UTF_8 encoding, and then you read those bytes back into a string but using some ISO-8859-X encoding. That's how you get mojibake: Take text, save in one encoding, read in another: Most non_ASCII has now turned into mojibake. You can't generally unbake this stuff, you have to get your encodings right, and read data in with the same encoding you wrote it with.
However, that is probably not the root cause here.
You've likely corrupted your source file and you pasted some source code into word, and then from word saved it, and then tried to read it with your parser. This borks your code, as word converts stuff. Such as converting "Hello" to “Hello”. Which you definitely do not want. You'll have to go in and undo all the damage done by hand, get a backup, or, if you're actually writing source code in MSWord, stop doing that right away - it is not a code editor and cannot be used to write code. Use notepad++, atom, eclipse, intellij, etc.
TL;DR:
Real fix: Stop using MSWord to edit source. It mangled it beyond suitable recognition.
If somehow you really wanted this (doubtful), find all places where you convert strings implicitly to bytes or vice versa and stop ever using those - you always want the explicit ones, where you specify charset. Then, specify StandardCharsets.UTF_8. There are many such methods and you have no code pasted here so I can't tell you where you call one of these. An example is new String(byteArr) - that method is forbidden and must never be called. Call new String(byteArr, StandardCharsets.UTF_8) instead. You've got something like this earlier in your code, and that made a ticking time bomb. It went off when you invoked .getRawSignature(), but you're just seeing the bomb go off, you need to fix it where you created it.

Display special characters using entity or hex values

I am trying to display ŵ through my jsf page but unable to do so. Basically the text with special characters is read from properties file , but on my application screen it becomes something else . I did try to use entity values but not succeeding for example if original text is :
ŵyhsne klqdw dwql
then after replacing with with entity or hexvalues:
**&wcirc ;**yhsne klqdw dwql but in my page it displays as it is

I can just guess your question. Please edit it and improve it.
If you are displaying in web, you should use &wcirc; (note: without spaces), but this also requires a fonts on client site that support such character.
If the string is in your code: replace the character with \u0175.
But probably the best way it is to use just ŵ either in code on in web, or on any file, and you should assure that such files (or sources) are interpreted ad UTF-8, and you deliver pages are UTF-8. If you are not using UTF-8, just check in similar way, that you are using consistently the correct encoding.
And sending a character doesn't mean it could be displayed. There is always the possibility that a font will not have all *special" characters in it.

Properly storing copy/pasted text from a Microsoft Office document into a MySQL database

I know that Microsoft office uses different encoding, what happen is when someone copy and paste texts from office to java text panel, it looks OK. But you then store it into MySQL database, and retrieve it. It suddenly become all kind of rubbish Latin characters.
I've tried to convert it to utf-8 before store, but seems not work.
Wonder if there is anyway you can detect whether there is any latin characters in your text, so I can simply popup an alert to let user know before they save it.
Or, if there is anyway to disable the jTextField to only display everything in UTF-8 characters, so that when user copy and paste from word, it auto shows all the random codes instead of looking fine (at the beginning)
Example: With user entered something in word, and paste to jTextField, we pass the string directly(Note our sql database is utf8_general_ci), we then just fetch it to the JPanel, and we get:
ÃƒÆ’Ã†â€™Ãƒâ€
Ã¢â‚¬â„¢ÃƒÆ’Ã¢â‚¬Å¡Ãƒâ€šÃ‚Â¢ÃƒÆ’Ã†â€™Ãƒâ€šÃ‚Â¢ÃƒÆ’Ã‚Â¢ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬Ãƒâ€¦Ã‚Â¡ÃƒÆ’Ã¢â‚¬Å¡Ãƒâ€šÃ‚Â¬ÃƒÆ’Ã†â€™Ãƒâ€šÃ‚Â¢ÃƒÆ’Ã‚Â¢ÃƒÂ¢Ã¢â€š

I've had similar issues. First thing to do is find out what exactly has been written to the database. This is very easy with MySQL, just logon and run
SELECT HEX( column ) FROM table;
That'll give you the bytes that have been written to the table. You can then use an app I wrote for this very purpose. Take the hex string you got back from MySQL and give it to the main class using the -b flag for bytes. You'll get a whole heap of output, and hopefully one of them will be what you had originally.
Once you know what it's being stored as, you have a starting point for debugging.

encountering a square or null or ? mark in retrieving utf-8 characters from mysql database in java

There is no problem when I try to insert this symbol "ñÑ" in the mysql database. However, when I try to retrieve the same data the symbol or character that was selected by the query would appear as null value or something like a ? or a square.
Please help me with these I have been troubled many weeks by these problems. I just cannot understand anymore. I have written the code in java.

The "�" is the replacement character, used when something processing characters can't display or otherwise handle a character. A box is sometimes used for the same purpose, or indicates that the font being used doesn't have a glyph for some character.
To resolve this, check that the character sets being used for the various components, such as the column and connection, are correct.
See also: "Setting the default Java character encoding?"

Linebreaks and Spaces appearing in TextAreas

In our Cocoon environment we have a few forms with textareas. Once the user submits a form, an overview is displayed before the final submit is done.
Therefor, each form-object's data is stored in POJOs.
If the user is on that overview page and decides to go back to the form, the form is filled with the already submitted data read from the POJOs. However, when filling the textarea with data from the JavaObject, some linebreaks and whitespaces are added to the data.
I checked the POJO's data for these linebreaks but the String looks clean. Each whitespace entered by the user is of Character 32, which is a simple space.
I also checked the Serializer (we use a custom one that extends Cocoon's AbstractSerializer) but no linebreaks/whitespaces added by accident here.
When using Javascript to output the current content of that Textarea though, it contains linebreak characters ('\n') as well as the aforementioned additional whitespaces.
My suspicion is that the conversion from Java's Space-Character to HTML's space characters somehow fails.
These linebreaks appear instead of spaces, not inside a single word. They also change position depending on the textarea's size. They are not at the end of a line, so they can't be forced by wrap or something.
Example:
User input "test test test test test" becomes "test\n [36x Space] test test test test"

Here's a thought... What do you use to actually output the page to the client? I'm not entirely familiar with the Cocoon environment but I assume you're using some sort of a "templating" engine (JSP? Velocity?). I'm talking about the actual file, on the server side, that has the textarea element; paste here the snippet of code that involves the textarea element and we'll see.

These extra linebreaks and whitespaces are typical from XSL transformations (that were developed unaware of such linebreaks/whitespace issues).
It is likely that you use XSLT in your cocoon application, and maybe they should be checked on that matter.
There are a number of well-known cautions you can take. You can start on SO (XSLT - remove whitespace from template) to get an idea of these.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.