BUG YUICompressor with special chars

BUG YUICompressor with special chars - java

I am using the newer version of YUICompressor (2.4.7) to compress my Javascript and CSS files, for a long time, everything was apparently fine...when I realized that the special characters "í" and "Í" are not being converted successfully. Strangely, another special chars are being converted as we expect. Why just "í" and "Í" are not being converted? Because of just these two chars are not OK, I discarded Charset Conflicts between file system and language. It looks like a bug. Could anyone help me with this problem?
See what happens when I convert files:
Converting CSS
From:
#import url("/láÍíàyout.css");
To:
#import url("/lá�?íàyout.css");
Converting JS
From:
var x = 'cícÍsúlúm irmãêîôûúàá';
To:
var x="c�c�?súlúm irmãêîôûúàá";

Hmm..when it has only to do with i then Turkey test comes to my mind.
The upper case i in Turkish is not I, it is I with a dot on it. When string manipulations are used with toUpperCase() or something then you must pay attention or your program won't run fine on turkish operating systems.
Example:
"fail".toUpperCase().equals("FAIL")
This code tries to do case-insensitive string comparison, but it fails on turkish systems.
When you're using a turkish system then try running it on another non-turkish system and tell us if the bug with YUICompressor still exists.

Is your character set UTF-8? If other, do you specify it (either as command line, or as argument to InputStreamReader/OutputStreamWriter)?
If using as servlet, do you set encoding on both request and response?
I've integrated yui compressor with my application today (version 2.4.7) and it is processing unicode characters correctly, so you may be missing one of above steps.

Related

ISO-8859-1 character encoding not working in Linux

I have tried below code in windows and able decode the message .But same code when i have tried Linux it's not working.
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
I have verified and could see the default character set in Linux platform is UTF-8(Charset.defaultCharset().name())
Kindly suggest me how to do same encoding Linux platform.

The explanation for this, is, almost always, that somewhere bytes are turned to characters or characters are turned to bytes there where the encoding is not clearly specified, thus, defaulting to 'platform default', thus, causing different results depending on which platform you run it on.
Except, every place where you turn bytes to chars or chars to bytes in your snippet of code explicitly specified encoding.
Or does it?
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
Ah, no, you forgot one place: javac itself.
You compile this code. That'll be where raw bytes (because the compiler is looking at ManmohansSourceFile.java, which is a file, which isn't characters, but a bunch of bytes) - which are converted into characters (because the java compiler works on characters), and this is done using some encoding. If you don't use the -encoding switch when running javac (or maven or gradle is running javac, and it passes an encoding, which one depends on your pom/gradle file), then this is read in using system encoding, and thus whether the string actually contains those bytes - who knows.
This is most likely the source of your problem.
The fix? Pick one:
Don't put non-ascii in your source files. Note that you can write the unicode symbol "Latin Capital Letter A with Tilde" as \u00C3 in your source file instead of as Ã. Then use \u00B6 for ¶.
String message ="\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
> ööööö
Ensure you specify the right -encoding switch when compiling. So, if your text editor (that you use to type String message = "¶";) is configured as 'UTF-8', and then run javac -encoding UTF-8 manMohansFile.java.

First of all, I'm not sure exactly what you are expecting...your use of the term "encode" is a bit confusing, but from your comments, it appears that with the input "Ã¶Ã¶Ã¶Ã¶Ã¶", you expect the output "ööööö".
On both Linux and OS X with Java 1.8, I do get that result.
I do not have a Windows machine to try this on.
As #Pshemo indicated, it is possible that your input, since it's hardcoded in the source code as a string, is being represented as UTF-8, not as ISO-8859-1. Actually, this is what I expected, and I was surprised that the code worked as you expected.
Try creating the input with String.encode(), encoding to ISO-8859-1.

Store Arabic in String and insert it into database using Java

I am trying to pass Arabic String into Function that store it into a database but the String's Chars is converted into '?'
as example
String str = new String();
str = "عشب";
System.out.print(str);
the output will be :
"???"
and it is stored like this in the database.
and if i insert into database directly it works well.

Make sure your character encoding is utf-8.
The snippet you showed works perfectly as expected.
For example if you are encoding your source files using windows-1252 it won't work.

The problem is that System.out.println is PrintWriter which converts the Arabic string into bytes using the default encoding; which presumably cannot handle the arabic characters. Try
System.out.write(str.getBytes("UTF-8"));
System.out.println();

Many modern operating systems use UTF-8 as default encoding which will support non-latin characters correctly. Windows is not one of those, with ANSI being the default in Western installations (I have not used Windows recently, so that may have changed). Either way, you should probably force the default character encoding for the Java process, irrespective of the platform.
As described in another Stackoverflow question (see Setting the default Java character encoding?), you'll need to changed the default as follows, for the Java process:
java -Dfile.encoding=UTF-8
Additionally, since you are running in IDE you may need to tell it to display the output in the indicated charset or risk corruption, though that is IDE specific and the exact instructions will depend on your IDE.
One other thing, is if you are reading or writing text files then you should always specify the expected character encoding, otherwise you will risk falling back to the platform default.

You need to set character set utf-8 for this.
at java level you can do:
Charset.forName("UTF-8").encode(myString);
If you want to do so at IDE level then you can do:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Intellij IDEA: Impossible to commit files: 'utf8' codec can't decode byte 0xcc in position 9

IntelliJ IDEA 14.0.1
Plugin: jetbrains-bitbucket-connector
I'm trying to commit files, but get the error:
Error:transaction abort!
rollback completed abort:
decoding near 'C:\Users\����\AppDa': 'utf8' codec can't decode byte
0xcc in position 9: invalid continuation byte!
Has anyone encountered this error? How can it be solved?
Thanks.

This probably isn't the answer which you're looking for but gives you some insight on what might be going on:
On most systems, file paths are made up from bytes since file systems were designed decades before Unicode. Unicode is retrofitted to them by interpreting the bytes as UTF-8 encoded strings. Unfortunately, there is no way to say "this is Cp-1251" and "this is UTF-8" inside of a file name. Therefore, the "convert file name to string" code relies on the platform's default encoding. NTFS solved the problem by always storing file names as Unicode (ignoring the local code page) but the names are translated into the local code page when you use a tool which displays them on screen.
And then comes Python 2 where Unicode was also retrofitted in a similar way. Python just has the advantage that you have two types of objects (str and unicode) so in theory, you can tell raw bytes and Unicode apart. The problems start when you get a bunch of bytes from somewhere and the logic says "this should be Unicode" - which happens when you read file names from disk.
In your case, the file system passes bytes which contain Cp1251 encoded characters to Python but the Python code tries to read them as UTF-8 encoded Unicode. For many characters (< code point 128), this works but it breaks for everything with a code point > 128. \xCC is a common case here since UTF-8 uses this byte to encode all code points between 128 and 256. This is why you see this error so often in Europe - we use those characters a lot.
Now the people who created Mercurial are well aware of all this. Most of the time, Mercurial should just work. See https://www.mercurial-scm.org/pipermail/mercurial/2009-January/023762.html
As I see it, your problem could be caused by:
Somehow, Windows used the local code page to create your home directory (unlikely)
Mercurial gets the path as Unicode but for some reason, it thinks that the string is raw bytes and tried to decode using a UTF-8 decoder. Since the decoding is applied twice, this fails. Maybe you have an old version of Mercurial. Try to update.
Maybe you showed us the wrong part of the error message and the problem is actually in a file which you tried to commit. In that case, we can ignore the odd � characters in the error message. Make sure you use the correct encoding when you edit the file.
To see which one it is, I suggest to create a folder C:\dev and work there. If this works, then there is something wrong with your home folder or Mercurial has a bug.

Error is saying that in your file location path there are few character which is not present in the utf-8 character set so the decoder is not able to decode the given file path and it is aborting the operation.
see the characters in the location path and correct it if there are any unknown character present in that
'C:\Users\����\AppDa'
here the ���� are showing that these characters are not able to decode by utf-8.
edit:
Check your string with this tool to see in which encoding your character set is encoded. link to tool
then you can use that encoder but this is not a practical solution use utf-16 char set it is having large character set, and it vary by platform and language.

I had the same problem, if you use Mercurial also, then here is the solution:
go to [project directory]/.hg
open "hgrc" file
below the [ui] insert username = my_name_only_utf_characters <mail#example.com>
save & commit

Turkish character while writing to database (postgresql)

I am working with Java and PostgreSQL on Windows . I have some words which include turkish characters like İ,ş,ö,ç etc.
In Java I assign words to a string and try to write it to the database. When I print it on java its encoding appears correct and all characters display correctly. However, while writing it to database the text appears to get mangled/scrambled.
I created my database with this command:
CREATE DATABASE dbname ENCODING "UTF-8"
I tried to fix it by converting Turkish characters into the ISO-8859-1 encoding like (İ -> \u0130 , ş -> \u015F)
//\u0130leti\u015Fim = İletişim
title = \u0130leti\u015Fim
String mytitle = new String(title.getBytes("ISO-8859-1"), "UTF-8");
And then I tried to write mytitle to database but it did not work.
Thanks for your advice.
SOLVED : I realized that it could write turkish characters to database, but the problem was on the response. I added these lines before write to response.
String contentType= "text/html;charset=UTF-8";
response.setContentType(contentType);
response.setCharacterEncoding("utf-8");
After adding this, it works now. I hope, i could explain cleanly.

When you call title.getBytes("ISO-8859-1"), you're promising the Java runtime that the characters in the string can be represented as ISO-8859-1 bytes, which is not actually true for either \u0130 or \u015f.
Therefore already the conversion to bytes will do something unspecified with your Turkish characters -- probably they will just be dropped.
Next, attempting to interpret whichever bytes you get out of it as UTF-8 even though they're really ISO-8859-1 is then guaranteed to make a complete mess of everything that wasn't ASCII to begin with.
(The repretoire of ISO-8859-1 happens to coincide exactly with the Unicode characters that can be written as \u00XX for some XX).

With encoding issues you have several things to check:
Whether your source file is in the encoding you expect it to be.
How client_encoding is set
What the database encoding is
In the case of Java, PgJDBC requires client_encoding to always be UTF-8 and will choke if you set it to something else, so that's not going to be the issue. You've shown that your database is UTF-8 too. So it seems likely that your Java sources aren't in the same encoding the Java compiler and runtime expect them to be in.
By default javac will interpret your source code in the platform default encoding. If you've saved your sources in a different encoding, weird things will happen. Save your sources either:
in the default encoding for your Windows platform;
as Unicode ("UTF-16" or "UCS-2"); or
As UTF-8 with a Byte Order Mark (BOM). Many programs don't add a BOM for UTF-8.
Then recompile your program. If that doesn't help, you'll need to follow up with more detail, starting with what exactly "it did not work" means, output of SELECTing the data you inserted with Java using psql, etc.

You should create the database like this:
CREATE DATABASE <db name>
WITH OWNER <owner user name>
TEMPLATE template0
ENCODING 'UTF-8'
LC_COLLATE 'tr_TR.UTF-8'
LC_CTYPE = 'tr_TR.UTF-8';

Why does equalsIgnoreCase() fail for the letters æ, ø, å when using UTF-8?

I am unable to print/compare the letters æøå with the uppercase letters ÆØÅ. My code is running on Mac OS X 10.6.4 in Eclipse STS 2.5 and I have set Eclipse to use UTF-8 instead of MacRoman. It seems that neither equalsIgnoreCase, toUpperCase and toLowerCase work, and I cannot print the letters correctly to the console. Any idea on what I am missing?
Example:
String ae1 = "æ";
String ae2 = "Æ";
System.out.println(ae1);
System.out.println(ae2.toLowerCase());
if(ae1.equalsIgnoreCase(ae2))
System.out.println("match");
else
System.out.println("no match");
Returns:
√¶
√ü
no match

Well, it's not at all clear which of the following situations you're in:
Your string literals are being compiled correctly, equalsIgnoreCase is failing, and the console is failing
Your string literals are being compiled incorrectly - and once you've got garbage data, nothing else is going to work
I strongly suggest you try using the \uxxxx format to make sure you get the right input data. You could analyze your current code by printing out the value of (int) ae1.charAt(0) and seeing which Unicode character that is.
Once you've separated things out to work out exactly which stage is failing, you can adjust the code appropriately - whether that's using a Collator or some other approach.

equals() is not meant for comparing natural languages. You should be using Collator: http://java-x.blogspot.com/2006/09/javatextcollator-for-string-comparison.html

Your output clearly says that your source files are UTF-8, but compiler is configured to read sources as Mac OS Roman.
Since you say you configured Eclipse to use UTF-8, perhaps your configuration is somehow wrong or incomplete.
To make sure that it's a problem with source encoding mismatch, you can replace these characters by their Unicode escapes. In this case equalsIgnoreCase() works as expected:
String ae1 = "\u00e6";
String ae2 = "\u00c6";

I guess my string literals are being compiled incorrectly because the compiler or eclipse is not configured properly, but I have not figured out what it is. Using the \uxxxx format did however solved my issues, so I will leave it at that for now.
If i stumble upon a solution I will post it here.
Thanks for your answers!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.