delete unwanted characters from URL - java

I have this variable String var = class.getSomething that contains this url http://www.google.com§°§#[]|£%/^<> .The output that comes out is this: http://www.google.comç°§#[]|£%/^<>. How can i delete that Ã? Thanks!

You could do this, it replaces any character for empty getting your purpouse.
str = str.replace("Â", "");
With that you will replace  for nothing, getting the result you want.

Use String.replace
var = var.replace("Ã", "");

specify the charset as UTF-8 to get rid of unwanted extra chars :
String var = class.getSomething;
var = new String(var.getBytes(),"UTF-8");

Do you really want to delete only that one character or all invalid characters? Otherwise you can check each character with CharacterUtils.isAsciiPrintable(char ch). However, according to RFC 3986 even fewer character are allowed in URLs (alphanumerics and "-_.+=!*'()~,:;/?$#&%", see Characters allowed in a URL).
In any case, you have to create a new String object (like with replace in the answer by Elias MP or putting valid characters one by one into a StringBuilder and convert it to a String) as Strings are immutable in Java.

The string in var is output using utf-8, which results in the byte sequence:
c2 a7 c2 b0 c2 a7 23 5b 5d 7c c2 a3 25 2f 5e 3c 3e
This happens to be the iso-8859-1 encoding of the characters as you see them:
§ ° §#[]| £%/^<>
ç°§#[]|£%/^<>
C2 is the encoding for Â.
I'm not sure how the à was produced; it's encoding is C3.
We need the full code to learn how this happened, and a description how the character encoding for text files on your system is configured.
Modifying the variable var is useless.

Related

How to convert character encoding from Shift-JIS to UTF-8 [duplicate]

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.
As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.
I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: …Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang Kühn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "ü";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Рc3 90 | � c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Þ c3 9e Þ
ß c3 9f | ß c3 9f ß
à c3 a0 | à c3 a0 à
á c3 a1 | á c3 a1 á
â c3 a2 | â c3 a2 â
ã c3 a3 | ã c3 a3 ã
ä c3 a4 | ä c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | æ c3 a6 æ
ç c3 a7 | ç c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.
What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).
You can use this tutorial
The charset you need should be defined in rt.jar (according to this)

Find a word in a line based on next word

I have a file and here is a portion of of the file. The common word in all lines is PIC here and I am able to find out the index of PIC. I am trying to extract the description for each line. Here how can I extract the word before the word PIC?
15 EXTR-SITE PIC X.
05 EXTR-DBA PIC X.
TE0305* 05 EXTR-BRANCH PIC X(05).
TE0305* 05 EXTR-NUMBER PIC X(06).
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
CW0104* 05 FILLER PIC X(567).
I have to get result like below
EXTR-SITE
EXTR-DBA
EXTR-NUMBER
-------
FILLER
Is there any expression I can use to find the word before 'PIC'?
Here is my code to get lines that contain 'PIC':
int wordStartIndex = line.indexOf("PIC");
int wordEndIndex = line.indexOf(".");
if ((wordStartIndex > -1) && (wordEndIndex >= wordStartIndex)) {
System.out.println(line); }
15 EXTR-SITE PIC X.
05 EXTR-DBA PIC X.
TE0305* 05 EXTR-BRANCH PIC X(05).
TE0305* 05 EXTR-NUMBER PIC X(06).
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
CW0104* 05 FILLER PIC X(567).
I think you need to find out more about COBOL before you approach this task.
Columns 1-6 can contain a sequence number, can be blank, or can contain anything. If you are attempting to parse COBOL code you need to ignore columns 1-6.
Column 7 is called the Indicator area. It may be blank, or contain an * which indicates a comment, or a -, which indicates the line is a continuation of the previous non-blank/non-comment line, or contain a D which indicates it is a debugging line.
Columns 73-80 may contain another sequence number, or blank, or anything, and must be ignored.
If your COBOL source was "free format", things would be a bit different, but it is not.
There is no sense in extracting data from comment lines, so your expected output is not valid. It is also unclear where you get the line of dashes in your expected output.
If you are trying to parse COBOL source, you must have valid COBOL source. This is not valid:
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
A level-number (the 05) is a group-item if it is followed by higher level-numbers (the two 10s). A group-item cannot have a PICture.
PIC itself can also be written in full, as PICTURE.
PIC can quite easily appear in an identifier/data-name (EPIC-CODE). As could PICTURE, in theory.
PIC and PICTURE could appear in a comment line, even if not a commented line of code.
The method you want to use to find the "description" (which is the identifier, or data-name) is flawed.
01 the-record.
05 fixed-part-of-record.
10 an-individual-item PIC X.
10 another-item COMP-1.
10 and-another COMP-3 PIC 9(3).
10 PIC X.
05 variable-part-of-record.
10 entry-name OCCURS 10 TIMES.
15 entry-name-client-first-name
PIC X(30).
15 entry-name-client-surname
PIC X(30).
That is just a short example, not to be considered all-encompassing.
From that, your method would retrieve
an-individual-item
COMP-3
and two lines of "whatever happens when PIC is the first thing on line"
To save this becoming a chameleon question, you need to ask a new question (or sort it out yourself) with a different method.
Depending on the source of the COBOL source, there are better ways to deal with this. If the source is an IBM Mainframe COBOL, then the source for your source should either be a compile listing or the SYSADATA from the compile.
From either of those, you'd pick up the identifier/data-name at a specific location under a specific condition. No parsing to do at all.
If you cannot get that, then I'd suggest you look for the level-number, and find the first thing after that. You will still have some work to do.
Level-numbers can be one or two digits, in the range 1-49, plus 66, 77, 88. Some compilers also have 78. If your extract is only "records" (likely) you won't see 77 or 78. You'll likely not see 66 (only seen it used once) and quite probably will see 88s, which you may or may not want to include in your output (depending on what you need it for).
1.
01.
01 FILLER.
01 data-name-name-1.
01 data-name-name-2 PIC X(80).
5.
05.
05 FILLER.
05 FILLER PIC X.
05 data-name-name-3.
05 data-name-name-4 PIC X.
The use of a single-digit for a level-number and not spelling FILLER explicitly are fairly "new" (from the 1985 Standard) and it is quite possible you don't have any of those. But you might.
The output from the above should be:
FILLER
FILLER
FILLER
data-name-name-1
data-name-name-2
FILLER
FILLER
FILLER
FILLER
data-name-name-3
data-name-name-4
I have no idea what you'd want to do with that output. With no context, it doesn't have a lot of meaning.
It is possible that your selected method would work with your actual data (assuming you pickled your sample, and that what you get is valid code).
However, it would still be simpler to say "if the first word on a line is one- or two-digit numeric, if there is a second word, that's what we want, else use FILLER". Noting, of course, the previous comments about what you should ignore.
Unless your source contains 88-levels. Because it would be quite common for a range of values to require a second line, and if the values happen to be numeric, and one or two digits, then that won't work either.
So, identify the source of your source. If it is an IBM Mainframe, attempt to get output from the compile. Then your task is really easy, and 100% accurate.
If you can't get that, then understand your data thoroughly. If you have really simple structures such that your method works, doing it from the level-number will still be easier.
If you need to come back to this, please ask a new question. Otherwise you're hanging out to dry the people who have already spent their time voluntarily answering your existing question.
If you are not committed to writing a Cobol parser yourself, a couple of options include:
Use the Cobol Compiler to process the Cobol copybook. This will create a listing of the Cobol-Copybook in a format that is easier to parse. I have worked at companies that converted all there Cobol-Copybooks to the equivalent easytrieve copybooks automatically by compiling the Cobol-Copybook in a Hello-World type program and processing the output.
Products like File-Aid have a Cobol parsers that produce an easily digested version of the Cobol Copybook.
The java project cb2xml will convert a Cobol-Copybook to Xml. The project provides some examples of processing the Xml with Jaxb.
To parse a Cobol-Copybook into a Java list of items using cb2xml (taken from Demo2.java):
JAXBContext jc = JAXBContext.newInstance(Condition.class, Copybook.class, Item.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
Document doc = Cb2Xml2.convertToXMLDOM(
new File(Code.getFullName("BitOfEverything.cbl").getFile()),
false,
Cb2xmlConstants.USE_STANDARD_COLUMNS);
JAXBElement<Copybook> copybook = unmarshaller.unmarshal(doc, Copybook.class);
The program Demo2.java will then print the contents of a cobol copybook out:
List<Item> items = copybook.getValue().getItem();
for (Item item : items) {
Code.printItem(" ", item);
}
And to print a Cobol-Item Code.java:
public static void printItem(String indent, Item item) {
char[] nc = new char[Math.max(1, 50 - indent.length()
- item.getName().length())];
String picture = item.getPicture();
Arrays.fill(nc, ' ');
if (picture == null) {
picture = "";
}
System.out.println(indent + item.getLevel() + " " + item.getName()
+ new String(nc) + item.getPosition()
+ " " + item.getStorageLength() + "\t" + picture);
List<Item> childItems = item.getItem();
for (Item child : childItems) {
printItem(indent + " ", child);
}
}
The output from Demo2 is like (gives you the level, field name, start, length and picture):
01 CompFields 1 5099
03 NumA 1 25 --,---,---,---,---,--9.99
03 NumB 26 3 9V99
03 NumC 29 3 999
03 text 32 20 x(20)
03 NumD 52 3 VPPP999
03 NumE 55 3 999PPP
03 float 58 4
03 double 62 8
03 filler 70 23
05 RBI-REPETITIVE-AREA 70 13
10 RBI-REPEAT 70 13
15 RBI-NUMBER-S96SLS 70 7 S9(06)
15 RBI-NUMBER-S96DISP 77 6 S9(06)
05 SFIELD-SEP 83 10 S9(7)V99
Another cb2xml example is DemoCobolJTreeTable.java which displays a COBOL copybook in a Tree table:
You can try regex like this :
public static void main(String[] args) {
String s = "15 EXTR-SITE PIC X.";
System.out.println(s.replaceAll("(.*?\\s+)+(.*?)(?=\\s+PIC).*", "$1"));
}
O/P:
EXTR-SITE
Explanation :
(.*?\\s+)+(.*?)(?=\\s+PIC).*", "$1") :
(.*?\\s+)+ --> Find one or more groups of "anything" which is followed by a space.
(.*?)(?=\\s+PIC) -->find a group of "any set of characters" which are followed by a space and the word "PIC".
.* --> Select everything after PIC.
$1 --> the contents of the actual String with the first captured group i.e, data between `()`.
PS : This works with all your current inputs :P
//let 'lines' be an array of all your lines
//with one complete line as string per element
for(String line : lines){
String[] splitted = line.split(" ");
for(int i = 0; i < splitted.length; i++){
if(splitted[i].equals("PIC") && i > 0) System.out.println(splitted[i-1]);
}
}
Please note that I didn't test this code yet (but will in a few minutes). However the general approach shold be clear now.
Try to use String.split("\\s+"). This method splits the original string into an array of Strings (String[]). Then, using Arrays.asList(...) you can transform your array into a List, so you can search for a particular object using indexOf.
Here is an extract of a possibile solution:
String words = "TE0305* 05 EXTR-BRANCH PIC X(05).";
List<String> list = Arrays.asList(words.split("\\s+"));
int index = list.indexOf("PIC");
// Prints EXTR-BRANCH
System.out.println(index > 0 ? list.get(index - 1) : ""); // Added a guard
In my honest opinion, this code lets Java working for you, and not the opposite. It is concise, readable and then more maintainable.

Extracting data from a text file - repeated values

79 0009!017009!0479%0009!0479 0009!0469%0009!0469
0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449
0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419
0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009
0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339
0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032
In this data, I'm supposed to extract the number 47, 46 , 45 , 44 and so on. I´m supposed to avoid the rest. The numbers always follow this flow - 9!0 no 9%
for example: 9!0 42 9%
Which language should I go about to solve this and which function might help me?
Is there any function that can position a special character and copy the next two or three elements?
Ex: 9!0 42 9% and ' 009
look out for ! and then copy 42 from there and look out for ' that refers to another value (009). It's like two different regex to be used.
You can use whatever language you want, or even a unix command line utility like sed, awk, or grep. The regex should be something like this - you want to match 9!0 followed by digits followed by 0%. Use this regex: 9!0(\d+)0% (or if the numbers are all two digits, 9!0(\d{2})0%).
The other answers are fine, my regex solution is simply "9!.(\d\d)"
And here's a full solution in powershell, which can be easily correlated to other .net langs
$t="79 0009!017009!0479%0009!0479 0009!0469%0009!0469 0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449 0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419 0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009 0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339 0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032"
$p="9!.(\d\d)"
$ms=[regex]::match($t,$p)
while ($ms.Success) {write-host $ms.groups[1].value;$ms=$ms.NextMatch()}
This is perl:
#result = $subject =~ m/(?<=9!0)\d+(?=9%)/g;
It will give you an array of all your numbers. You didn't provide a language so I don't know if this is suitable for you or not.
Pattern regex = Pattern.compile("(?<=9!0)\\d+(?=9%)");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}

Java cannot retrieve Unicode (Lithuanian) letters from Access via JDBC-ODBC

i have DB where some names are written with Lithuanian letters, but when I try to get them using java it ignores Lithuanian letters
DbConnection();
zadanie=connect.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql="SELECT * FROM Clients;";
dane=zadanie.executeQuery(sql);
String kas="Imonė";
while(dane.next())
{
String var=dane.getString("Pavadinimas");
if (var!= null) {var =var.trim();}
String rus =dane.getString("Rusys");
System.out.println(kas+" "+rus);
}
void DbConnection() throws SQLException
{
String baza="jdbc:odbc:DatabaseDC";
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
}catch(Exception e){System.out.println("Connection error");}
connect=DriverManager.getConnection(baza);
}
in DB type of field is TEXT, size 20, don't use any additional letter decoding or something like this.
it gives me " Imonė Imone " despite that in DB is written "Imonė" which equals rus.
Now that the JDBC-ODBC Bridge has been removed from Java 8 this particular question will increasingly become just an item of historical interest, but for the record:
The JDBC-ODBC Bridge has never worked correctly with the Access ODBC Drivers ("Jet" and "ACE") for Unicode characters above code point U+00FF. That is because Access stores such characters as Unicode but it does not use UTF-8 encoding. Instead, it uses a "compressed" variation of UTF-16LE where characters with code points U+00FF and below are stored as a single byte, while characters above U+00FF are stored as a null byte followed by their UTF-16LE byte pair(s).
If the string 'Imonė' is stored within the Access database so that it appears properly in Access itself
then it is stored as
I m o n ė
-- -- -- -- --------
49 6D 6F 6E 00 17 01
('ė' is U+0117).
The JDBC-ODBC Bridge does not understand what it receives from the Access ODBC driver for that final character, so it just returns
Imon?
On the other hand, if we try to store the string in the Access database with UTF-8 encoding, as would happen if the JDBC-ODBC Bridge attempted to insert the string itself
Statement s = con.createStatement();
s.executeUpdate("UPDATE vocabulary SET word='Imonė' WHERE ID=5");
the string would be UTF-8 encoded as
I m o n ė
-- -- -- -- -----
49 6D 6F 6E C4 97
and then the Access ODBC Driver will store it in the database as
I m o n Ä —
-- -- -- -- -- ---------
49 6D 6F 6E C4 00 14 20
C4 is 'Ä' in Windows-1252 which is U+00C4 so it is stored as just C4
97 is "em dash" in Windows-1252 which is U+2014 so it is stored as 00 14 20
Now the JDBC-ODBC Bridge can retrieve it okay (since the Access ODBC Driver "un-mangles" the character back to C4 97 on the way out), but if we open the database in Access we see
ImonÄ—
The JDBC-ODBC Bridge has never and will never be able to provide full native Unicode support for Access databases. Adding various properties to the JDBC connection will not solve the problem.
For full Unicode character support of Access databases without ODBC, consider using UCanAccess instead. (More details available in another question here.)
As you're using the JDBC-ODBC bridge, you can specify a charset in the connection details.
Try this:
Properties prop = new java.util.Properties();
prop.put("charSet", "UTF-8");
String baza="jdbc:odbc:DatabaseDC";
connect=DriverManager.getConnection(baza, prop);
Try to use this "Windows-1257" instead of UTF-8, this is for Baltic region.
java.util.Properties prop = new java.util.Properties();
prop.put("charSet", "Windows-1257");

"Fix" String encoding in Java

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.
As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.
I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: …Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang Kühn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "ü";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Рc3 90 | � c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Þ c3 9e Þ
ß c3 9f | ß c3 9f ß
à c3 a0 | à c3 a0 à
á c3 a1 | á c3 a1 á
â c3 a2 | â c3 a2 â
ã c3 a3 | ã c3 a3 ã
ä c3 a4 | ä c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | æ c3 a6 æ
ç c3 a7 | ç c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.
What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).
You can use this tutorial
The charset you need should be defined in rt.jar (according to this)

Categories