Java jar execution discrepancies between machines - java

I wrote a bit of code that verifies the contents of a comma delimited file by checking each field against some regular expression - the particular regex that is causing me grief is a very basic date regex (\\d{2}/\\d{2}/\\d{2}). If the data in that field doesn't match, it's supposed to write out to a separate file indicating that it needs to be inspected, like so:
private static int DATE_FIELD = 5;
File input = new File("input.txt");
Pattern p = Pattern.compile("\\d{2}/\\d{2}/\\d{2}");
BufferedReader reader = new BufferedReader(new FileReader(input));
String line = reader.readLine();
while(line != null){
String[] splitLine = line.split(",");
Matcher m = p.matcher(splitLine[DATE_FIELD]);
if(!m.matches) {
// write warning to separate file
}
line = reader.readLine();
}
This code is compiled as part of a larger JAR file which is installed on 4 computers in the office (mine and three others). The jar file is invoked via a shell call made by a separate program, passing in the relevant parameters. This is part of a QC check before we import the data into our database, and the date is a required field, so if the date field was left blank, it should be flagged for review.
The regex that I used should not allow for a blank date to pass, and when I run it on my machine, it properly flags the missing dates. However, on my coworkers machines, the blank dates were somehow not flagged, as if the field wasn't checked at all, which caused a little grief when the file was being imported into the database.
In other words, there is some discrepancy between our machines that caused the code to execute incorrectly on their machines, but not mine. All of the machines have Java 8 (not sure exactly which version, but they should all be the same version). How can that be?

You need to specify the encoding of the file(s) you want to read.
[The constructors of FileReader] generally use the platform default encoding. So determine the actual encoding and use something like new InputStreamReader(new FileInputStream(input), <encoding>)
Check the java version for each machine. Verify that the designated java is actually called
Check the encoding of the file(s) itself (UTF-8, CP1252, or ...)

Related

Can't use latin characters correctly when creating new files from within Java. Filenames get weird characters instead of the correct ones

Currently saving an int[] from hashmap in a file with the name of the key to the int[]. This exact key must be reachable from another program. Hence I can't switch name of the files to english only chars. But even though I use ISO_8859_1 as the charset for the filenames the files get all messed up in the file tree. The english letters are correct but not the special ones.
/**
* Save array to file
*/
public void saveStatus(){
try {
for(String currentKey : hmap.keySet()) {
byte[] currentKeyByteArray = currentKey.getBytes();
String bytesString = new String(currentKeyByteArray, StandardCharsets.ISO_8859_1);
String fileLocation = "/var/tmp/" + bytesString + ".dat";
FileOutputStream saveFile = new FileOutputStream(fileLocation);
ObjectOutputStream out = new ObjectOutputStream(saveFile);
out.writeObject(hmap.get(currentKey));
out.close();
saveFile.close();
System.out.println("Saved file at " + fileLocation);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Could it have to do with how linux is encoding characters or is more likely to do with the Java code?
EDIT
I think the problem lies with the OS. Because when looking at text files with cat for example the problem is the same. However vim is able to decode the letters correctly. In that case I would have to perhaps change the language settings from the terminal?
You have to change the charset in the getBytes function as well.
currentKey.getBytes(StandardCharsets.ISO_8859_1);
Also, why are you using StandardCharsets.ISO_8859_1? To accept a wider range of characters, use StandardCharsets.UTF_8.
The valid characters of a filename or path vary depending on the file system used. While it should be possible to just use a java string as filename (as long as it does not contain characters invalid in the given file system), there might be interoperability issues and bugs.
In other words, leave out all Charset-magic as #RealSkeptic recommends and it should work. But changing the environment might result in unexpected behavior.
Depending on your requirements, you might therefore want to encode the key to make sure it only uses a reduced character set. One variant of Base64 might work (assuming your file system is case sensitive!). You might even find a library (Apache Commons?) offering a function to reduce a string to characters safe for use in a file name.

How do you reference a file in a path that has a space in it in Java?

I'm working on a project that involves reading a txt file, and the way I currently have it set up is with...
reader = new BufferedReader(new FileReader(new File(url)));
...where url is a String. I don't have it set up for the user to input their own file path (or my ultimate goal to be able to choose it in a window, but that's a different matter), so I Just have url set to something like...
"file:///C:/Users/Jeremiah/Desktop/generic_text_file.txt"
My problem is that, with this technique, I can't include spaces in the file path or I'll get an invalid character exception, yet most of files and directories on a computer that a person actually deals with has spaces in it, even ones that come on the computer like "My Documents".
I've also tried passing the String through a method to escape the spaces by adding "\" in front of them, but that still isn't working.
public String escapeSpaces(String string){
int cursor = 0;
System.out.println(string);
while(cursor<string.length()){
if(string.charAt(cursor)==' '){
string = string.substring(0,cursor)+"\\"+string.substring(cursor, string.length());
System.out.println(string);
cursor++;
}
cursor++;
}
return string;
}
So how would one get around this issue so that I could instead reference a file in say...
"file:///C:/Users/Jeremiah/Desktop/S O M A N Y S P A C E S/generic_text_file.txt"
Any feedback is appreciated.
You can't construct a File with a URl string. Just pass a proper filename string directly to the constructor of File, or indeed the constructor of FileReader. There is no issue with spaces in the filename.
it still doesn't allow me to use a file path with spaces
Yes it does. You are mistaken.
escaped or not
Filenames do not require escaping. URLS require escaping. But you're just making an unnecessary mess by using the URL class.

How do I replace illegal characters in a filename?

I am trying to create a zip with folders inside it and I have to sanitize the folder names against any illegal characters. I did some googling around and found this method from http://www.rgagnon.com/javadetails/java-0662.html:
public static String sanitizeFilename(String name) {
return name.replaceAll("[\\\\/:*?\"<>|]", "-");
}
However, upon testing I get some weird results. For example:
name = filename£/?e>"e
should return filename£--e--e from my understanding. But instead it returns filename-ú--e--e
Why is this so?
Please note that I am testing this by opening the downloaded zip file in WinZip and looking at the folder name that is created. I can't get the pound sign to appear. I've also tried this:
public static String sanitizeFilename(String name) {
name = name.replaceAll("[£]", "\u00A3");
return name.replaceAll("[\\\\/:*?\"<>|]", "-");
}
EDIT: Some more research and I found this: http://illegalargumentexception.blogspot.co.uk/2009/04/i18n-unicode-at-windows-command-prompt.html
It appears to do with Locale, windows versions and encoding factors. Not sure how I can overcome this within the code.
I think it depends on how you are actually reading the file name in terms of encoding.
Therefore, the £ symbol might get corrupted.
As an example not fitting your case exactly, reading UTF-8-encoded £ as an ISO Latin 1-encoded character would return £.
Make sure of the file's encoding (i.e. ISO Latin 1 vs UTF-8 would be the most common), then use the appropriate parameter for your Reader.
As a snippet, you may want to consider this example:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream(new File("yourTextFile")),
"[your file's encoding]"
)
);

Get list of processes on Windows in a charset-safe way

This post gives a solution to retrieve the list of running processes under Windows. In essence it does:
String cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
Process p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream());
BufferedReader input = new BufferedReader(isr);
then reads the input.
It looks and works great but I was wondering if there is a possibility that the charset used by tasklist might not be the default charset and that this call could fail?
For example this other question about a different executable shows that it could cause some issues.
If that is the case, is there a way to determine what the appropriate charset would be?
Can break this into 2 parts:
The windows part
From java you're executing a Windows command - externally to the jvm in "Windows land". When java Runtime class executes a windows command, it uses the DLL for consoles & so appears to windows as if the command is running in a console
Q: When I run C:\windows\system32\tasklist.exe in a console, what is the character encoding ("code page" in windows terminology) of the result?
windows "chcp" command with no argument gives the active code page number for the console (e.g. 850 for Multilingual-Latin-1, 1252 for Latin-1). See Windows Microsoft Code Pages, Windows OEM Code Pages, Windows ISO Code Pages
The default system code page is originally setup according to your system locale (type systeminfo to see this or Control Panel-> Region and Language).
the windows OS/.NET function getACP() also gives this info
The java part:
How do I decode a java byte stream from the windows code page of "x" (e.g. 850 or 1252)?
the full mapping between windows code page numbers and equivalent java charset names can be derived from here - Code Page Identifiers (Windows)
However, in practice one of the following prefixes can be added to achieve the mapping:
"" (none) for ISO, "IBM" or "x-IBM" for OEM, "windows-" OR "x-windows-" for Microsoft/Windows.
E.g. ISO-8859-1 or IBM850 or windows-1252
Full Solution:
String cmd = System.getenv("windir") + "\\system32\\" + "chcp.com";
Process p = Runtime.getRuntime().exec(cmd);
// Use default charset here - only want digits which are "core UTF8/UTF16";
// ignore text preceding ":"
String windowsCodePage = new Scanner(
new InputStreamReader(p.getInputStream())).skip(".*:").next();
Charset charset = null;
String[] charsetPrefixes =
new String[] {"","windows-","x-windows-","IBM","x-IBM"};
for (String charsetPrefix : charsetPrefixes) {
try {
charset = Charset.forName(charsetPrefix+windowsCodePage);
break;
} catch (Throwable t) {
}
}
// If no match found, use default charset
if (charset == null) charset = Charset.defaultCharset();
cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream(), charset);
BufferedReader input = new BufferedReader(isr);
// Debugging output
System.out.println("matched codepage "+windowsCodePage+" to charset name:"+
charset.name()+" displayName:"+charset.displayName());
String line;
while ((line = input.readLine()) != null) {
System.out.println(line);
}
Thanks for the Q! - was fun.
Actually, the charset used by tasklist is always different from the system default.
On the other hand, it's quite safe to use the default as long as the output is limited to ASCII. Usually executable modules have only ASCII characters in their names.
So to get the correct Strings, you have to convert (ANSI) Windows code page to OEM code page, and pass the latter as charset to InputStreamReader.
It seems there's no comprehensive mapping between the these encodings. The following mapping can be used:
Map<String, String> ansi2oem = new HashMap<String, String>();
ansi2oem.put("windows-1250", "IBM852");
ansi2oem.put("windows-1251", "IBM866");
ansi2oem.put("windows-1252", "IBM850");
ansi2oem.put("windows-1253", "IBM869");
Charset charset = Charset.defaultCharset();
String streamCharset = ansi2oem.get(charset.name());
if (streamCharset) {
streamCharset = charset.name();
}
InputStreamReader isr = new InputStreamReader(p.getInputStream(),
streamCharset);
This approach worked for me with windows-1251 and IBM866 pair.
To get the current OEM encoding used by Windows, you can use GetOEMCP function. The return value depends on Language for non-Unicode programs setting on Administrative tab in Region and Language control panel. Reboot is required to apply the change.
There are two kinds of encodings on Windows: ANSI and OEM.
The former is used by non-Unicode applications running in GUI mode.
The latter is used by Console applications. Console applications cannot display characters that cannot be represented in the current OEM encoding.
Since tasklist is console mode application, its output is always in the current OEM encoding.
For English systems, the pair is usually Windows-1252 and CP850.
As I am in Russia, my system has the following encodings: Windows-1251 and CP866.
If I capture output of tasklist into a file, the file can't display Cyrillic characters correctly:
I get ЏаЁўҐв instead of Привет (Hi!) when viewed in Notepad.
And µTorrent is displayed as зTorrent.
You cannot change the encoding used by tasklist.
However it's possible to change the output encoding of cmd. If you pass /u switch to it, it will output everything in UTF-16 encoding.
cmd /c echo Hi>echo.txt
The size of echo.txt is 4 bytes: two bytes for Hi and two bytes for new line (\r and \n).
cmd /u /c echo Hi>echo.txt
Now the size of echo.txt is 8 bytes: each character is represented with two bytes.
Why not use the Windows API via JNA, instead of spawning processes? Like this:
import com.sun.jna.platform.win32.Kernel32;
import com.sun.jna.platform.win32.Tlhelp32;
import com.sun.jna.platform.win32.WinDef;
import com.sun.jna.platform.win32.WinNT;
import com.sun.jna.win32.W32APIOptions;
import com.sun.jna.Native;
public class ListProcesses {
public static void main(String[] args) {
Kernel32 kernel32 = (Kernel32) Native.loadLibrary(Kernel32.class, W32APIOptions.UNICODE_OPTIONS);
Tlhelp32.PROCESSENTRY32.ByReference processEntry = new Tlhelp32.PROCESSENTRY32.ByReference();
WinNT.HANDLE snapshot = kernel32.CreateToolhelp32Snapshot(Tlhelp32.TH32CS_SNAPPROCESS, new WinDef.DWORD(0));
try {
while (kernel32.Process32Next(snapshot, processEntry)) {
System.out.println(processEntry.th32ProcessID + "\t" + Native.toString(processEntry.szExeFile));
}
}
finally {
kernel32.CloseHandle(snapshot);
}
}
}
I posted a similar answer elsewhere.
There is a much better way to check the running processes, or even to run OS command through java: Process and ProcessBuilder.
As for the Charset, you can always inquire the OS about the supported charsets, and obtain an Encoder or Decoder according to your needs.
[Edit]
Let's break it down; there's no way of knowing in which encoding the bytes of a given String are, so your only choice is to get those bytes, shift the ordering as necessary (if you're ever in such an environment where a process can give you an array of bytes in different ordering, use ByteBuffer to deal with that), and use the multiple CharsetDecoders supported to decode the bytes to reasonable output.
It is overkill and requires you to estimate that a given output could be in UTF-8, UTF-16 or any other encoding. But at least you can decode the given output using one of the possible Charsets, and then try to use the processed output for your needs.
Since we're talking about a process run by the same OS in which the JVM itself is running, it is quite possible that your output will be in one of the Charset encodings returned by the availableCharsets() method.

UTF-8 character encoding in Java

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.
The original string is
HANDICAP╔ES
which is supposed to be
HANDICAPÉES
Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.
Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
Map<String, Object> row = this.rowIter.next();
// convert fields to UTF
Map<String, Object> rowUTF = new HashMap<String, Object>();
try {
for (String key : row.keySet()) {
Object o = row.get(key);
if (o != null) {
String valueCP850 = o.toString();
// String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
rowUTF.put(key, valueUTF8);
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Encoding exception: " + e);
}
}
In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.
Thanks,
Cam
New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.
Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
getBytes("CP850") converts the character ╔ to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");
...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.
More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.
And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.
Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:
System.out.println("HANDICAPÉES");
The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:
CHCP 1252
To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.
As for writing to a file, just specify the desired encoding when you create the Writer:
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("myFile.txt"), "UTF-8");
String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES
This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).
Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.
Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:
Align your environment and binary pipelines to use all the one and same character encoding.
You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.
Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.
If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.
You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:
DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
Table table = open.getTable("FolderInfo");
Using "ISO-8859-1" helped me deal with the French charactes.

Categories