How to find if the file is a CSV file? - java

I have a scenario wherein the user uploads a file to the system. The only file that the system understands in a CSV, but the user can upload any type of file eg: jpeg, doc, html. I need to throw an exception if the user uploads anything other than CSV file.
Can anybody let me know how can I find if the uploaded file is a CSV file or not?

CSV files vary a lot, and they all could be called, legitimately, CSV files.
I guess your approach is not the best one, the correct approach would be to tell if the uploaded file is a text file the application can parse instead of it it's a CSV or not.
You would report errors whenever you can't parse the file, be it a JPG, MP3 or CSV in a format you cannot parse.
To do that, I would try to find a library to parse various CSV file formats, else you have a long road ahead writing code to parse many possible types of CSV files (or restricting the application's flexibility by supporting few CSV formats.)
One such library for Java is opencsv

If you're using some library CSV parser, all you would have to do is catch any errors it throws.
If the CSV parser you're using is remotely robust, it will throw some useful errors in the event that it doesn't understand the file format.

I can think of several methods.
One way is to try to decode the file using UTF-8. (This is built into Java and is probably built into .NET too.) If the file decodes properly, then you at least know that it's a text file of some kind.
Once you know it's a text file, parse out the individual fields from each line and check that you get the number of fields that you expect. If the number of fields per line is inconsistent then you might just have a file that contains text but is not organized into lines and fields.
Otherwise you have a CSV. Then you can validate the fields.

If it's a web application, you might want to check the content-type HTTP header the browser sends when uploading/posting a file through a form.
If there's a bind for the language you're using, you might also try using libmagic, is pretty good at recognizing file types. For example, the UNIX tool file uses it.
http://sourceforge.net/projects/libmagic/

I don't know if you can tell for 100% certain in any way, but I'd suggest that the first validations should be:
Is the file extension .csv
Count the number of commas in the file per line, there should normally be the same amount of commas on each line of the file for it to be a valid CSV file. (As Jkramer said, this only works if the files can't contain quoted commas).

try this one :
String type = Files.probeContentType(Paths.get(filepath));

I solved it like this: read the file with UTF-16 encoding, if no comma is found in the file, it means UTF-16 encoding didnt work. Which means that this csv file is of Excel format (NOT plain text).
if(fileA.endsWith(".csv") && fileB.endsWith(".csv")) {
second_list=readCSVFile(fileA);
new_list=readCSVFile(fileB);
if(!String.join("", second_list).contains(",") || !String.join("", new_list).contains(",")) {
//read these files with UTF-8 encoding
System.out.println("[WARN] csv files will be read like text files. (UTF-16 encoding couldnt find any comma in the file i.e., UTF-16 encoding didn't work)");
second_list=readFile(fileA);
new_list=readFile(fileB);
} else {
// keep the csv as UTF-16 encoded
}

Related

BufferedOutputStream not working with Korean characters as expected

I'm trying to write Korean characters to a File and it's writing some gibberish data which I need to work around for showing as Korean data when I open it in CSV. How can I achieve my requirement without the workaround of decoding back to UTF-8 and show Korean data.
File localExport = File.createTempFile("char-test", ".csv");
try (
FileOutputStream fos = new FileOutputStream(localExport);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter outputStreamWriter =
new OutputStreamWriter(bos, StandardCharsets.UTF_8)
) {
ArrayList<String> rows = new ArrayList<>();
rows.add("\"가짜 사용자\",사용자123,saint1_user123");
rows.add("\"페이크유저루노도스트레스 성도1\",saint1_user1");
for (int i=0; i<2; i++) {
String csvUserStr = rows.get(i);
outputStreamWriter.write(csvUserStr);
}
}
It's writing the below data instead of the one I'm actually writing to the File.
There is absolutely nothing wrong with your java code. You are writing those characters, including the korean, precisely as written.
Whatever tool you are using to look at this file?
That's the broken one. Tell it that the file is UTF-8 based. If you can't, get a better tool or figure out which encoding it reads in, and update your java code.
Note that CSV files, text files, etc - they do not store the encoding that was used to write the data. All the programs that read/write to the file need to just know what encoding it is, there's no real way to know other than being told.
UPDATE: From a comment it looks like 'the tool that is reading this' is excel.
Excel asks for the encoding of the file when you use the 'import CSV' dialog. Pick UTF-8 in the dropdown. Depends on which version/OS you're on, but usually it's called 'File Origin'.
If you prefer that your client need not mess with the default, usually the default is something like MacRoman or Win1282, and with such an encoding, it is in fact impossible get korean characters. They simply aren't in that set.
if you want the fire and forget approach, generate the excel file yourself, for example using Apache POI.
CSV files don't have any means to carry encoding information "in-band"—in the file itself. I'm guessing the default character encoding used for Excel CSV imports is the system default, so if that isn't Korean, they will have to specify the encoding when they import the CSV. If your client requires CSV, they have no choice but to accept that behavior.
However, if their requirement is to open your file in Excel (and not that the file has to be CSV format), you could write an Excel spreadsheet instead. The various Excel file formats do include character encoding information, so they would be able to open the file without manually specifying the encoding.
Library recommendations are off-topic, but libraries such Apache POI make writing simple Excel sheets fairly easy. There are additional benefits as well, such as taking care of any necessary escaping for you, so that your file doesn't repeatedly break when unanticipated values are included in the spreadsheet.
As mentioned Excel fails to detect that the text is encoded in UTF-8. One solution is to write an invisible BOM character as first one:
outputStreamWriter.write("\uFEFF");
for...
This is a normally superfluous and ugly marker for miscellaneous UTF encoding.
By the way take a look at the class Files, that can reduce the code to one line.

converting .prn file in to html page using java

how to convert .prn file in to html page using java.
I am treating it as a text file and reading it line by line but thats quite cumbersome as each line requires its own splitting logic. As prn file is nicely formatted can we directly extract the file and load it as an html?any suggessions?
Since a .prn file is byte stream that is sent to printer for printing, I think you are going to have to keep using your custom parser as it doesn't appear that the Java Print Service has any options for parsing.
If the tags are consistent with other file formats it may be worth while to check out other parsing libraries such as simple.json and modify them to your needs.

Reading the content of file section wise in java

I want to read the content of any files like doc, pdf, ppt etc section or paragraph wise in java, because i want to retrieve a particular section of a file (if have) instead of retrieving the content of whole file.. Please can anyone tell me, How can i read the content of any file either section or paragraph wise………..
Thanks
This depends entirely on the format of the file in question. For example, when you have a .docx file, you can employ some XML parser and then iterate through the result or use XPath to find all paragraphs, sections or whatever you wish to extract.
For other file formats you will have to find a different approach. There is no single way to extract a specific part of any file, as different file types have different ways of storing data. Most likely, you will have to collect a bunch of libraries, one for each file type.

Why I am getting mimetype text/plain for CSV file using JMimeMagic lib?

I am using JMimeMagic lib to validate CSV file upload.
For CSV and every other text file (txt, JSP etc) it gives me text/plain mime type.
logger.debug("Checking magic content");
MagicMatch match;
match = Magic.getMagicMatch(getPromotionOptIn().getUpload(),false);
logger.debug("Actual file mimetype=" + match.getMimeType());
Should not I get text/csv for CSV file? (See all list of mime types).
Or it's fine for it that I put my validation on text/plain thinking its a valid CSV file.
Since CSV files can have multiple different separators I suspect the csv file is just recognized as a text file (which is true).
If you see a text file how do you know for sure it is a CSV file? If there are commas, semi colons etc. in the text? What if those belong to an entry and the separator is something else (like |, #, #, etc.)?
You'll have difficulties with telling for sure without more information and JMimeMagic will have the same problems. Thus it will only return what it is sure about: the file is a text file. Thus you "only" get "text/plain".
I don't know that library but from the documentation/source it seems like you could give a hint that *.csv files have text/csv mime type using Magic.addHint("csv", someMatcher). Note that you might have to pass true for the second parameters as otherwise those hints might be ignored (seems so from looking at the sources).
That would still depend on the file extension to be correct, i.e. if someone uploaded a .csv file that contains something else you'll get the wrong mime type.
However, it seems like JMimeMagic would not do much content checking anyways. At least I didn't find much in the sources I found at sourceforge/github. There's only a text file detector so you might have to add your own content detectors for other mime types and file formats.
My guess is that JMimeMagic uses the first few bytes of the file to determine the type. This is possible for many different file types as they have very standard headers. Some text files, like HTML, will have the text <html somewhere near the beginning, thus giving you a good guess as to what type of file it is.
This sort of deduction is not possible for CSV files. They do not have standard headers. It is difficult to programmatically tell a CSV file from a shopping list with commas in it. It does give you a correct answer of text/plain, as all CSV files are.

Excel or text file, which one to use?

I need to suggest an input, excel file or text file.
assuming the input is large number of lines where I need to read the first String, for example:
A,B,C,D....
I need to read the first String (in this case A) to identify the matching row, should I use excel file and use POI to read the first cell of each row? or text file where each line tokens are separated by delimiter and to parse each line reading the first token.
Use a text file. Because computers like it more. If business requires it, rename that text file into a "csv" file and you've got an Excel file.
If humans are going to enter data then use Excel. If the file is used as a communication channel between two systems use as simple as possible file.
If at all possible, use text file - much easier to handle/troubleshoot, easier to generate, uses less memory, does not have restrictions on number of rows, etc. In general - more predictable.
If you go with text files and you have people manually preparing those text files, and you are dealing with non-ASCII text, you better make sure everybody will send you the files in correct encoding (usually UTF-8 would be the best). This is not an issue with Excel.
The only reason to use Excel workbook would be when you need some "business-people" to produce those input files, then that input effectively becomes a user interface to your system - Excel is usually considered more user friendly than Notepad. ;-)
If you do go with Excel, make sure that the people producing those Excel files will give you the correct version (I assume you would want the "old" XLS format, not the new XLSX format).
Rule of thumb: use a text file. It's more interchangeable and way easier to handle by any other software you may need to support in a few years.
If you need some humans to edit those data and you need some beautiful/color display the Excel can provide, consider creating a macro that would store data in csv.

Categories