Display special characters using entity or hex values - java

I am trying to display ŵ through my jsf page but unable to do so. Basically the text with special characters is read from properties file , but on my application screen it becomes something else . I did try to use entity values but not succeeding for example if original text is :
ŵyhsne klqdw dwql
then after replacing with with entity or hexvalues:
**&wcirc ;**yhsne klqdw dwql but in my page it displays as it is

I can just guess your question. Please edit it and improve it.
If you are displaying in web, you should use ŵ (note: without spaces), but this also requires a fonts on client site that support such character.
If the string is in your code: replace the character with \u0175.
But probably the best way it is to use just ŵ either in code on in web, or on any file, and you should assure that such files (or sources) are interpreted ad UTF-8, and you deliver pages are UTF-8. If you are not using UTF-8, just check in similar way, that you are using consistently the correct encoding.
And sending a character doesn't mean it could be displayed. There is always the possibility that a font will not have all *special" characters in it.

Related

how to compare a value's encoding of string type with a specific encoding in java?

I'm told to write a code that get a string text and check if its encoding is equal the specific encoding that we want or not. I've searched a lot but I didn't seem to find anything. I found a method (getEncoding()) but it just works with files and that is not what I want. and also I'm told that i should use java library not methods of mozilla or apache.
I really appreciate any help. thanks in advance.
What you are thinking of is "Internationalization". There are libraries for this like, Loc4j, but you can also get this using java.util.Locale in Java. However in general text is just text. It is a token with a certain value. No localization information is stored in the character. This is why a file normally provides the encoding in the header. A console or terminal can also provide localization using certain commands/functions.
Unless you know the source encoding and the token used you will have a limited ability to guess what encoding is used in the other end. If you still would want to do this you will need to go into deeper areas such as decryption where this kind of stuff usually is done using statistic analysis. This in turn requires databases on the usage of different tokens and depending on the quality of the text, databases and algorithms a specific amount of text is required. Special stuff, like writing Swedish with eg. US encoding (like using a for å and ä or o for ö) will require more advanced analysis.
EDIT
Since I got a comment that encoding and internationalization is different entities I will add some comments. It is possible to work with different encodings working plainly with English (like some English special characters). It is also possible to work with encodings using for example Charset. However for many applications using different encodings it may still be efficient to use Locale, since this library can do a lot of operations on text with different encodings.
Thanks for ur answers and contribution but these two link did the trick. I had already seen these two pages but it didn't seem to work for me cause I was thinking about get the encoding directly and then compare it with the specific one.
This is one of them
This is another one.

Java: Print Text With Strikethrough

I'm printing to a file. Is there a way to print the text with strikethrough through it? I have done some googling, but did not find any applicable answers.
You would have to save the file in a PDF, HTML or create some kind of word processor document. Simple text (or more correctly plaintext) does not have formatting ... in any language ...
I'd recommend HTML. It is simple to create (PDF is a pain), gives you the option of other formatting (people always end up asking for a heading), allows you to format as tables (managers love tables), and will open anywhere (could even be served on a web-server, eliminating printing and tree-killing altogether).
If you want to force it, you can use the unicode index of those letters, like this:
"\u03C0" //π
http://unicode-table.com/de/0268/
This, as an example is the ɨ

Reading PDF in java as a file and making "PDF" editable

I have a program which will be used for building questions database. I'm making it for a site that want user to know that contet was donwloaded from that site. That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove e.g. footer or watermark, unlike in some simpler file types). That explains why it HAS to be PDF.
This program will be used by numerous users which will create new databases or expand existing ones. That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user).
And what I want to do is to create PDF files which are still editable with my program once created.
I want to achieve this by implementing my custom file type readable with my program into the output PDF.
I came up with three ways of doing that:
Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries.
Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. Knowing it's location, it should be relativley easy to retrieve it using PDF library.
I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. I could implement my whole custom file into PDF that way. The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. I understand that PDF can't be read like a text file since it's a binary file (Right?).
In the end, I decided to go with the method number 3.
Unless someone has any better ideas, and the conditions are:
1. One file only. And that file is PDF.
2. User must not be aware of the addition.
The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library).
So, does anyone have a better idea?
If not, how do I read PDF as a FILE, so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition?
In Java, there is no real difference between text and binary files, you can read them both as an inputstream. The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files.
So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF.
A better way is to use another existing way of encoding data in a PDF: XMP tags. This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf.
There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html. See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html
It's all just 1's and 0's - just use RandomAccessFile and start reading. The PDF specification defines what a valid newline character(s) is/are (there are several). Grab a hex editor and open a PDF and you can at least start getting a feel for things. Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries.
Here's a related question that may be of interest: PDF parsing file trailer
I would suggest putting your comment immediately before the startxref line. If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers.
So a simple algorithm for inserting your special comment will be:
Go to the end of the file
Search backwards for startxref
Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment
Save the PDF
You can (and should) do this manually in a hex editor.
Really important: are your users going to be saving changes to these files? i.e. if they fill in the form field, are they going to hit save? If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard).
XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML.
I personally recommend using iText for this, but I'm biased (I'm one of the devs). The iText In Action book has an excellent chapter on embedding XMP data into PDFs. Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217

Linebreaks and Spaces appearing in TextAreas

In our Cocoon environment we have a few forms with textareas. Once the user submits a form, an overview is displayed before the final submit is done.
Therefor, each form-object's data is stored in POJOs.
If the user is on that overview page and decides to go back to the form, the form is filled with the already submitted data read from the POJOs. However, when filling the textarea with data from the JavaObject, some linebreaks and whitespaces are added to the data.
I checked the POJO's data for these linebreaks but the String looks clean. Each whitespace entered by the user is of Character 32, which is a simple space.
I also checked the Serializer (we use a custom one that extends Cocoon's AbstractSerializer) but no linebreaks/whitespaces added by accident here.
When using Javascript to output the current content of that Textarea though, it contains linebreak characters ('\n') as well as the aforementioned additional whitespaces.
My suspicion is that the conversion from Java's Space-Character to HTML's space characters somehow fails.
These linebreaks appear instead of spaces, not inside a single word. They also change position depending on the textarea's size. They are not at the end of a line, so they can't be forced by wrap or something.
Example:
User input "test test test test test" becomes "test\n [36x Space] test test test test"
Here's a thought... What do you use to actually output the page to the client? I'm not entirely familiar with the Cocoon environment but I assume you're using some sort of a "templating" engine (JSP? Velocity?). I'm talking about the actual file, on the server side, that has the textarea element; paste here the snippet of code that involves the textarea element and we'll see.
These extra linebreaks and whitespaces are typical from XSL transformations (that were developed unaware of such linebreaks/whitespace issues).
It is likely that you use XSLT in your cocoon application, and maybe they should be checked on that matter.
There are a number of well-known cautions you can take. You can start on SO (XSLT - remove whitespace from template) to get an idea of these.

a question related to URL

Dear all,Now i have this question in my java program,I think it should be classified as URL problem,but not 100% sure.If you think I am wrong,feel free to recategorize this problem,thanks.
I would state my problem as simply as possible.
I did a search on the famouse Chinese search engine baidu.com for a Chinese key word "奥巴马" (Obama in English),and the way I do that is to pass a URL (in a Java Program)to the browser like:
http://news.baidu.com/ns?word=奥巴马
and it works perfectly just like I input the "奥巴马”keyword in the text field on baidu.com.
However,now my advisor wants another thing.Since he can not read the Chinese webpages,but he wants to make sure the webpages I got from Baidu.com is related to "Obama",he asked me to google translate it back,i.e,using google translate and translate the Chinese webpage to English one.
This sounds straightforward.However,I met my problem here.
If I simply pass the URL "http://news.baidu.com/ns?word=奥巴马" into Google Translate and tick "Chinese to English" translating option,the result looks awful.(I don't know the clue here,maybe related to Chinese character encoding).
Alternatively,if now my browser opens ""http://news.baidu.com/ns?word=奥巴马" webpage,but I click on the "百度一下" button (that simply means "search"),you will notice the URL will get changed,now if I pass this URL into the Google translate and do the same thing,the result works much better.
I hope I am not making this problem sound too complicated,and I appologize for some Chinese words invovled,but I really need your guys' help here.Becasue I did all this in a Java program,I couldn't figure out how to realize that "百度一下"(pressing search button) step then get the new URL.If I could get that new URL,things are easy,I could just call Google translate in my Java code,and pops out the new window to show my advisor.
Please share any of your idea or thougts here.Thanks a lot.
Robert
You could use
URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")
then pass the resulting URL to Google Translate like:
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=YOUR_URL
Cheers
When you press the search button, the browser encodes the search term into %E5%A5%A5%E5%B7%B4%E9%A9%AC, which is the UTF-8 encoding for 奥巴马. It does this because UTF-8 is the default encoding for HTML forms.
Java uses a UTF-16 encoding internally, so it’s possible that the URL library builds a request in that encoding if you do not specify anything.
However, I could not reproduce your problem with Google translate — pasting that URL appeared to work correctly no matter how I did it.
Try calling
URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")
(or utf-16; I'm not quite familiar with the Chinese characters representation)
URLs can contain only ASCII characters. All other characters must be converted to bytes then %-encoded in ASCII. However there is no mandate on what charset is used to convert chars to bytes. UTF-8 is recommended, but not required. As long as a server expresses its preference on charset, the client should respect that and use the same charset for encoding.
You can see from page info that baidu uses gb2312 encoding. The characters 奥巴马 in a form on its page will be converted to bytes in gb2312: B0C2 B0CD C2ED, then %-encoded to %B0%C2%B0%CD%C2%ED. That is what actually sent to baidu server, http://www.baidu.com/s?wd=%B0%C2%B0%CD%C2%ED
Your OS happens to be configured to use gb2312 by default, therefore when you paste http://news.baidu.com/ns?word= 奥巴马 to the browser, browser does the same thing, and baidu gets the correct chars. When I paste that URL in my browser, it screws up, because my OS uses UTF-8, and the browser encodes these chinese characters in UTF-8, not something baidu expectes. (when entering a URL directly in a browser, the browser may not have communicated to the server and does not know the charset the server prefers, therefore the browser uses platform default charset)
Now, Google uses UTF-8. That's why if you paste the URL to google form, it will screw up just like on my OS. The chars are encoded in UTF-8, and baidu will try to parse it as gb2312, and gets totally wrong words.
Solution is easy. Just encode the parameter in the way that the server expects:
"http://news.baidu.com/ns?word=" + URLEncoder.encode("奥巴马", "gb2312")

Categories