Git finds changes because of encoding difference - java

I have cloned a java project from a git repository. After opening the project in Eclipse and making some changes, I noticed that git was finding changes in parts of the code that I didn't change. The code looked unchanged on Eclipse so I opened one of the files with Notepad++ where it looks like below if the chosen encoding is UTF-8 (default).
If I change the Notepad++ encoding to ANSI, however, I see the same characters as in the diff.
If I overwrite the problematic character in Notepad++ using the ANSI encoding, git doesn't see any changes.
I understand that the same two bytes that represent à and « in ANSI probably represent the single character ë in UTF-8, but I don't understand how this problem came to be by just cloning the repository and opening the code on Eclipse.

Related

IntelliJ keeps switching to UTF8 (I want to set CP-1252)

I have some projects which are encoded with Windows-1252/CP-1252 and I can't change the encoding. The problem is, no matter what I do, intelliJ will keep trying to read these files as UTF-8 unless I manually put every single file in the encoding list.
That requires a lot of time and effort, it's error-prone and it's not a solution at all. I have set the entire project and IDE encoding as CP-1252 but it keeps trying to read files as UTF-8 anyway.
I don't know what causes that. We are using Subversion to commit files and maven to compile (which uses UTF-8 to read files except for the super POM which uses CP-1252).
Any idea how to solve the problem? I gave a look at other posts but I found no real solution yet. I'm currently using the last IntelliJ version (2017.1.2)
I actually found out what was the problem. Maven project encoding was overriding Intellij configurations. I tried to edit the source encoding property before but it didn't work because I misspelled Cp1252. Now it seems to work.

IntelliJ 13 Ult. Changes encoding randomly

So I have many modules that have different encodings (UTF-8 and ISO-8859-7)
And I set their respective encodings in the File encoding section of settings
and my maven builds state the encoding of each project as well. I also have these files in git and I am not alone in these project ie other people commit in them as well.
What happens for the last month is this:
Randomly on some files the encoding changes to UTF-8 and when I try to change it back to ISO-8859-7 it doesnt change when I select the option "Reload" it does nothing. The encoding does not change.
What I have to do is the following steps:
Select ISO-8859-7 as encoding
Press Convert in the options
then when the file is marked as changed in git I revert it
only then does the encoding go back to normal
My IntelliJ version:
My question is this:
Is this a bug or am I doing something wrong/missing something that I should have configured?
UPDATE:
I want to add some more info I have found useful
The problem appears every time I open IntelliJ and on files that I have fixed their encoding the day before.
(I dont know if this is related but it happens as well) When I
convert the file to the ISO encoding the encoding and then revert the
file so that I can see the text, the git branch becomes detached
Also I am the only one working on this project so noone else changes
the files commited in git
I was able to reproduce this by committing a project with mixed files to Git without IntelliJ IDEA project files, which is what should be done usually because people use different IDEs.
Then if I clone the project to another directory and create a new IDEA project, the encoding settings are different, but even if I correct them, the first time a Greek file is loaded IDEA thinks it is UTF-8 because "Autodetect UTF-encoded files" is active. If the setting is inactive it is the other way around: Greek files will be displayed correctly but UTF-8 ones are shown as Greek.
Whatever went wrong, I was easily able to switch the encoding and select "reload" from the pop-up dialogue.
Then I committed the IDEA files (.idea subdirectory plus MyProject.iml file) to the same repository and cloned again. This time the files were both displayed correctly. So this is my suggested workaround. Hopefully it works for all IDEA users in your project. I have no idea what happens if e.g. an Eclipse user adds a new file to the repo which is still unknown to IDEA when you pull/fetch next time. Probably you will have to assign the right encoding again.
So how does IDEA remember which files or directories have which encoding? Look at the file .idea/encodings.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="Encoding" useUTFGuessing="true" native2AsciiForPropertiesFiles="false" defaultCharsetForPropertiesFiles="ISO-8859-7">
<file url="file://$PROJECT_DIR$/src/utf8/Unicode.java" charset="UTF-8" />
<file url="PROJECT" charset="ISO-8859-7" />
</component>
</project>
There you can see that I have ISO-8859-7 set as the default for normal and properties files and UTF guessing switched on, just like you in your screenshots. You also see that project file $PROJECT_DIR$/src/utf8/Unicode.java has an UTF-8 encoding. What you do not see listed there are files with Greek encoding because that is the project default anyway.
Bottom line: If you at least commit .idea/encodings.xml you can preserve the encoding information. This is not a bug in IDEA in my opinion but a normal thing. How can IDEA know the encoding if you do not have a saved project state? Encodings are not stored in Git or on the file system, it is meta information. This is probably why XML files mention their own encoding right at the beginning on the very first line (as you can also see above in my snippet). Java files do not have such an encoding tag. So for mixed encoding projects you just need to be careful. ;-)
Update: My test repo is on GitHub if you like to play around.

Linux - Java/File encoding

I've searched around the web a while now and haven't found anything giving me a proper answer.
I've got a linux server running debian and a bukkit server, I've rusn my server on windows before and my files seems to go right with UTF-8 encoding. I uploaded my files via winscp and now they seems to be ASCII or something else. Because ingame and also in the files every special char, like umlauts changed to placeholders and ingame to questionmarks.
I've tried to change encoding of a file (would be hard to do this for every file... asspecially if I need to to that everytime uploading a new one) but it only changed to a single questionmark instead of these placeholder stuff.
For jenkins I needed to change encoding via encoding=... in the javac execution in my build.xml but I don't know any flag to change encoding for the java cmd.
I also read it should be possible to change the encoding for the whole java but the tried cmds didn't worked at all.
I would be happy to get some tips how to fix this or in general how to avoid converting every file I upload...
Thank you very much :)
~Julian
You can try
java -Dfile.encoding=UTF-8 *.jar
to run a java project in specific encoding no matter what default encoding the current system use.
if you intend to change all files in a project to a specific encoding in eclipse
right click on your project in project explorer -> Properties(or Alt+Enter) -> Resource -> look on the right, you can see Text File Encoding, Then you can choose UTF-8 as needed.
Remember to check all your packages(right click and check Text File Encoding part) that they all inherited from container.
Hope this help!

Byte Order Mark generating a file using Mono in Ubuntu

My .NET utility AjGenesis is a code generation tool. The compiled binaries runs without glitches under Ubuntu 10.x, and Mono. But I have a problem: generating a java text file (a normal text file for my tool) it generates Byte Order Mark at the beginning of each file. I'm using System.Text.Encoding.Default: in Windows, all OK, in Ubuntu, the Byte Order Mark are three bytes, indicating UTF8, I guess.
This difference is a problem, when I want to compile the generate .java files using ant, or javac, the BOMs generate errors. Then:
What encoding to use in Ubuntu/Mono so the generated files could be processed by javac?
I tried javac -encoding UTF8 without success, any clues? My guess: it's not for skip BOMs.
I tried System.Text.Encoding.ASCII. But my generated files have non ASCII files (Spanish accented letters). If I change the encoding, the BOMs are added, and javac refuses the files. Any suggestion?
TIA
Don't use Encoding.Default. Why make your output platform specific? Use UTF-8 - and if you have to use UTF-8 without a BOM, you can do that with:
Encoding utf8 = new UTF8Encoding(false);
To be honest though, I'm surprised javac fails. You say you've tried it "without success" - what was the result?
Try instantiating System.Text.UTF8Encoding and supplying a parameter value that doesn't include BOMs. You may read about this here:
http://msdn.microsoft.com/en-us/library/s064f8w2.aspx

Character Enconding in my java classes in Eclipse is messed up. How to fix it?

I got a eclipse project that was working OK.
One day I had to format my machine, so I copied my workspace into a backup, after installing eclipse again, I imported my projects from my backed up workspace.
What happened is that it corrupted all the string that contains special characters..
like.. é, são, etc.. to É, são...
Is there a way to refactor it back to normal?
I tried changing character encoding in eclipse, but it doesn't update the class files.
You need to reconfigure the workspace encoding. Go to Window > Preferences and enter filter text "encoding" and set to UTF-8 everywhere when applicable, especially the Workspace's text encoding.
Is there a way to refactor it back to normal?
Did you try closing an individual file, right-clicking it to open properties and setting its encoding manually?
like.. é, são, etc.. to É, são...
Are you sure it wasn't É (U+00C9) that was becoming É (U+00C3 U+2030)?
That would suggest that files that were being interpreted as UTF-8 before are now being interpreted as something else (probably windows-1252).
Many Java compilation problems can be fixed by sticking to the subset of values that appear in the US-ASCII character encoding and using Unicode escape sequences for everything else (this assumes you aren't using UTF-16 or UTF-32 or something).
The code String foo = "É"; would become String foo = "\u00C9";. This improves source code portability at the expense of readability.
You can use the JDK native2ascii tool to perform the conversion:
native2ascii -encoding UTF-8 Foo.java
This is probably a stupid suggestion, but I figured I'd mention it anyways. Are you opening your file.class or your file.java? Because the *.class files, if I recall correctly, are binary files which explains why you're seeing those weird characters. The *.java files are the plain-text source files.
I figure you knew that, but the wording made me feel otherwise, so I figure I'd mention it.
How do the actual .java files on disk look? Are they damaged, or is it just that eclipse can't display them properly? If the files look good on disk, setting the encoding property as jjnguy suggest should do the trick.
If the files are damaged on disk, maybe iconv can "undamage" them?
To avoid such problem in the future, I actually suggest keeping the java files in plain ASCII, using \uNNNN escapes for non-ascii characters in strings etc (e.g. \u00E4 is ä / ä / ä / an a with two dots above).

Categories