I am trying to read a dataset file with a CSV extension, it just has two columns: 1- URL and 2- Lable (malicious or benign) and 2 rows, its a test sample from a bigger dataset, so how can I do that?
csv file content:
https://www.google.com
0
http://atualizacaodedados.online
1
first, I import URI
import java.net.URI;
and then I try this code:
Scanner dataset = new Scanner(new FileReader("urltest.csv"));
dataset.useDelimiter(",");
URI uri[]= new URI[2];
while(dataset.hasNext()) {
for (int i = 0; i < 2; i++) {
uri[i] = new URI(dataset.next());
System.out.println(uri[i]);
}
}
but it gives me this error:
Exception in thread "main" java.net.URISyntaxException: Illegal character in scheme name at index 0: https://www.google.com
at java.base/java.net.URI$Parser.fail(URI.java:2966)
at java.base/java.net.URI$Parser.checkChars(URI.java:3137)
at java.base/java.net.URI$Parser.checkChar(URI.java:3147)
at java.base/java.net.URI$Parser.parse(URI.java:3162)
at java.base/java.net.URI.<init>(URI.java:623)
at feturesExtraction2.datasetFile.main(datasetFile.java:22)
You should trim it like this
dataset.next().trim()
maybe you need to cast string before using this
Related
Im working on my code where I am importing two csv files and then parsing them
//Importing CSV File for betreuen
String filename = "betreuen_4.csv";
File file = new File(filename);
//Importing CSV File for lieferant
String filename1 = "lieferant.csv";
File file1 = new File(filename1);
I then proceed to parse them. For the first csv file everything works fine. The code is
try {
Scanner inputStream = new Scanner(file);
while(inputStream.hasNext()) {
String data = inputStream.next();
String[] values = data.split(",");
int PInummer = Integer.parseInt(values[1]);
String MNummer = values[0];
String KundenID = values[2];
//System.out.println(MNummer);
//create the caring object with the required paramaters
//Caring caring = new Caring(MNummer,PInummer,KundenID);
//betreuen.add(caring);
}
inputStream.close();
}catch(FileNotFoundException d) {
d.printStackTrace();
}
I then proceed to parse the other csv file the code is
// parsing csv file lieferant
try {
Scanner inputStream1 = new Scanner(file1);
while(inputStream1.hasNext()) {
String data1 = inputStream1.next();
String[] values1 = data1.split(",");
int LIDnummer = Integer.parseInt(values1[0]);
String citynames = values1[1];
System.out.println(LIDnummer);
String firmanames = values1[2];
//create the suppliers object with the required paramaters
//Suppliers suppliers = new
//Suppliers(LIDnummer,citynames,firmanames);
//lieferant.add(suppliers);
}
inputStream1.close();
}catch(FileNotFoundException d) {
d.printStackTrace();
}
the first error I get is
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
at Verbindung.main(Verbindung.java:61)
So I look at my array which is firmaname at line 61 and I think, well it's impossible that its out of range since in my CSV file there are three columns and at index 2 (which I know is the third column in the CSV file) is my list of company names. I know the array is not empty because when i wrote
`System.out.println(firmanames)`
it would print out three of the first company names. So in order to see if there is something else causing the problem I commented line 61 out and I ran the code again. I get the following error
`Exception in thread "main" java.lang.NumberFormatException: For input
string: "Ridge"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at Verbindung.main(Verbindung.java:58)`
I google these errors and you know it was saying im trying to parse something into an Integer which cannot be an integer, but the only thing that I am trying to parse into an Integer is the code
int LIDnummer = Integer.parseInt(values1[0]);
Which indeed is a column containing only Integers.
My second column is also indeed just a column of city names in the USA. The only thing with that column is that there are spaces in some town names like Middle brook but I don't think that would cause problems for a String type. Also in my company columns there are names like AT&T but i would think that the & symbol would also not cause problems for a string. I don't know where I am going wrong here.
I cant include the csv file but here is a pic of a part of it. The length of each column is a 1000.
A pic of the csv file
Scanner by default splits its input by whitespace (docs). Whitespace means spaces, tabs and newlines.
So your code will, I think, split the whole input file at every space and every newline, which is not what you want.
So, the first three elements your code will read are
5416499,Prairie
Ridge,NIKE
1765368,Edison,Cartier
I suggest using method readLine of BufferedReader then calling split on that.
The alternative is to explicitly tell Scanner how you want it to split the input
Scanner inputStream1 = new Scanner(file1).useDelimiter("\n");
but I think this is not the best use of Scanner when a simpler class (BufferedReader) will do.
First of all, I would highly suggest you try and use an existing CSV parser, for example this one.
But if you really want to use your own, you are going to need to do some simple debugging. I don't know how large your file is, but the symptoms you are describing lead me to believe that somewhere in the csv there may be a missing comma or an accidental escape character. You need to find out what line it is. So run this code and check its output before it crashes:
int line = 1;
try {
Scanner inputStream1 = new Scanner(file1);
while(inputStream1.hasNext()) {
String data1 = inputStream1.next();
String[] values1 = data1.split(",");
int LIDnummer = Integer.parseInt(values1[0]);
String citynames = values1[1];
System.out.println(LIDnummer);
String firmanames = values1[2];
line++;
}
} catch (ArrayIndexOutOfBoundsException e){
System.err.println("The issue in the csv is at line:" + line);
}
Once you find what line it is, the answer should be obvious. If not, post a picture of that line and we'll see...
I use java and Apache POI to read .xlsx files.(60k+ rows), but I get the error.
I use the latest version maven plugin of poi and xmlbeans.
According to the related questions I found in StackOverflow, the latest poi should process files successfully with the special character.
I can replace the special character in the program by myself if it's an xml file. But it's an excel file.
The difficulty is that I have no idea to use poi read the "excel" file successfully.
Or is there any way to process the file?
I use openjdk, version: "1.8.0_171-1-redhat".
the error message like this
Caused by: java.io.IOException: unable to parse shared strings table
at org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:134)
at org.apache.poi.xssf.model.SharedStringsTable.<init>(SharedStringsTable.java:111)
... 11 more
Caused by: org.apache.xmlbeans.XmlException: error: Character reference "�" is an invalid XML character.
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3440)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.SstDocument$Factory.parse(Unknown Source)
at org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:123)
the code
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import org.apache.commons.codec.binary.Base64;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class test2 {
public static void main(String[] args) throws Exception {
File file = new File("D:\\Users\\3389\\Desktop\\Review\\drive-download-20181112T012605Z-001\\ticket.xlsx");
Workbook workbook = null;
XSSFWorkbook xssfWorkbook = new XSSFWorkbook(file); //error occured
workbook = new SXSSFWorkbook(xssfWorkbook);
Sheet sheet = xssfWorkbook.getSheetAt(0);
System.out.println("the first row:"+sheet.getFirstRowNum());
}
}
pom.xml
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.0.0</version>
</dependency>
UTF16SurrogatePairs in the shareString.xml (several examples)
👍👍👍
👍
👍👍👍👍👍👍👍
etc....
Since your question title contains the question "Is there any way to preprocess the excel file?", I will try a answer about that:
Assumes:
The /xl/sharedStrings.xml in the *.xlsx file contains UTF-16-surrogate-pair XML numeric character references like 😁 = 😁. This is OK for HTML. But it is not allowed in Office Open XML because there the encoding is UTF-8 always and both the surrogate characters are not allowed in that XML.
So if the /xl/sharedStrings.xml in the *.xlsx file contains UTF-16-surrogate-pair XML numeric character references then the file is corrupt and should not be used anyway. The problem should be solved from those who have created that *.xlsx file.
But if nevertheless the need is repairing that file, then this can only be done on string level. Parsing XML is not possible because of the UTF-16-surrogate-pair XML numeric character references. Then the need is getting the /xl/sharedStrings.xml out of the *.xlsx file. Then get the string content of that /xl/sharedStrings.xml file. Then replace each found UTF-16-surrogate-pair XML numeric character reference with it's Unicode replacement.
My code shows how to do this using java.util.regex.Matcher. It searches for entities matching the pattern &#(\\d{5});&#(\\d{5});. If found it gets the surrogate pair High and Low as integers. Then it checks whether this are really surrogate pairs ( H must be between 0xD800 and 0xDBFF and L must be between 0xDC00 and 0xDFFF). If so it calculates N as N = (H - 0xD800) * 0x400 + (L - 0xDC00) + 0x10000. Then it replaces the UTF-16-surrogate-pair XML numeric character reference with a Unicode numeric character reference. After that all it replaces leftover single parts of supplement pairs with empty string. So they will be removed since single parts of supplement pairs are not allowed.
import java.io.*;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class XSSFWrongXMLinSharedStrings {
static String replaceUTF16SurrogatePairs(String string) {
Pattern pattern = Pattern.compile("&#(\\d{5});&#(\\d{5});");
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
String found = matcher.group();
int h = Integer.valueOf(matcher.group(1));
int l = Integer.valueOf(matcher.group(2));
if (0xD800 <= h && h < 0xDC00 && 0xDC00 <= l && l < 0xDFFF) {
int n = (h - 0xD800) * 0x400 + (l - 0xDC00) + 0x10000;
System.out.print(found + " will be replaced with ");
System.out.println("&#" + n + ";");
string = string.replace(found, "&#" + n + ";");
}
}
pattern = Pattern.compile("&#(\\d{5});");
matcher = pattern.matcher(string);
while (matcher.find()) {
String found = matcher.group();
int n = Integer.valueOf(matcher.group(1));
if (0xD800 <= n && n < 0xDFFF) {
System.out.println(found + " is single part of supplement pair. It will be removed.");
string = string.replace(found, "");
}
}
return string;
}
public static void main(String[] args) throws Exception {
File file = new File("ticket.xlsx");
//Repairing the /xl/sharedStrings.xml on string level. Parsing XML is not possible because of the UTF-16-surrogate-pair XML numeric character references.
OPCPackage opcPackage = OPCPackage.open(file);
PackagePart packagePart = opcPackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
ByteArrayOutputStream sharedStringsBytes = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int length;
InputStream inputStream = packagePart.getInputStream();
while ((length = inputStream.read(buffer)) != -1) {
sharedStringsBytes.write(buffer, 0, length);
}
inputStream.close();
String sharedStrings = sharedStringsBytes.toString("UTF-8");
//Replace UTF-16-surrogate-pair XML numeric character reference with it's unicode replacement:
//sharedStrings = sharedStrings.replace("😁", "😁");
//ToDo: Create method for replacing all possible UTF-16-surrogate-pair XML numeric character references with their unicode replacements.
sharedStrings = replaceUTF16SurrogatePairs(sharedStrings);
OutputStream outputStream = packagePart.getOutputStream();
outputStream.write(sharedStrings.getBytes("UTF-8"));
outputStream.flush();
outputStream.close();
opcPackage.close();
//Now the /xl/sharedStrings.xml in the file does not contain UTF-16-surrogate-pair XML numeric character references any more.
Workbook workbook = new XSSFWorkbook(file);
Sheet sheet = workbook.getSheetAt(0);
System.out.println("Success.");
}
}
I'm using univocity 2.7.5 to parse csv file. Till now it worked fine and parsed a row in csv file as String array with n elements, where n = number of columns in a row. But now i have a file, where rows start with quote " and the parser cannot handle it. It returns a row as String array with only one element which contains whole row data. I tried to remove that quote from csv file and it worked fine, but there are about 500,000 rows. What should i do to make it work?
Here is the sample line from my file (it has quotes in source file too):
"100926653937,Kasym Amina,620414400630,Marzhan Erbolova,""Kazakhstan, Almaty, 66, 3"",87029845662"
And here's my code:
CsvParserSettings settings = new CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(csvFile);
Author of the library here. The input you have there is a well-formed CSV, with a single value consisting of:
100926653937,Kasym Amina,620414400630,Marzhan Erbolova,"Kazakhstan, Almaty, 66, 3",87029845662
If that row appeared in the middle of your input, I suppose your input has unescaped quotes (somewhere before you got to that line). Try playing with the unescaped quote handling setting:
For example, this might work:
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE);
If nothing works, and all your lines look like the one you posted, then you can parse the input twice (which is shitty and slow but will work):
CsvParser parser = new CsvParser(settings);
parser.beginParsing(csvFile);
List<String[]> out = new ArrayList<>();
String[] row;
while ((row = parser.parseNext()) != null) {
//got a row with unexpected length?
if(row.length == 1){
//break it down again.
row = parser.parseLine(row[0]);
}
out.add(row);
}
Hope this helps.
I am trying to read the xlsx file using the below code but getting unable to result class.. any help would be appreciated...
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed: Script7.groovy: 8: unable to resolve class XSSFWorkbook # line 8, column 11. srcBook = new XSSFWorkbook(new FileInputStream(new File("C:\\PerTableData\\TestData-Mix.xlsx"))) ^ org.codehaus.groovy.syntax.SyntaxException: unable
import org.apache.poi.ss.usermodel.*;
//import org.apache.poi.ss.usermodel.DataFormatter
//Create data formatter
//dFormatter = new DataFormatter()
//Create a new workbook using POI API
srcBook = new XSSFWorkbook(new FileInputStream(new File("C:\\PerTableData\\TestData-Mix.xlsx")))
//Create formula evaluator to handle formula cells
fEval = new XSSFFormulaEvaluator(srcBook)
//Get first sheet of the workbook (assumes data is on first sheet)
sourceSheet = srcBook.getSheetAt(0)
//Sets row counter to 0 (first row)-- if your sheet has headers, you can set this to 1
context.rowCounter = 0
//Read in the contents of the first row
sourceRow = sourceSheet.getRow(0)
//Step through cells in the row and populate property values-- note the extra work for numbers
elNameCell = sourceRow.getCell(0)
testCase.setPropertyValue("ElName",dFormatter.formatCellValue(elNameCell,fEval))
atNumCell = sourceRow.getCell(1)
testCase.setPropertyValue("AtNum",dFormatter.formatCellValue(atNumCell,fEval))
symbolCell = sourceRow.getCell(2)
testCase.setPropertyValue("Symbol",dFormatter.formatCellValue(symbolCell,fEval))
atWtCell = sourceRow.getCell(3)
testCase.setPropertyValue("AtWeight",dFormatter.formatCellValue(atWtCell,fEval))
boilCell = sourceRow.getCell(4)
testCase.setPropertyValue("BoilPoint",dFormatter.formatCellValue(boilCell,fEval))
//Rename request test steps for readability in the log; append the element name to the test step names
testCase.getTestStepAt(0).setName("GetAtomicNumber-" + testCase.getPropertyValue("AtNum"))
testCase.getTestStepAt(1).setName("GetAtomicWeight-" + testCase.getPropertyValue("AtWeight"))
testCase.getTestStepAt(2).setName("GetElementySymbol-" + testCase.getPropertyValue("Symbol"))
//Add references to sheet to re-use it in ReadNextLine step
context.srcWkSheet = sourceSheet
add the import:
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
^^^^
you have only added:
import org.apache.poi.ss.usermodel.*;
Hi i have a small problem and think i'm just not getting the correct syntax on one line of code. basically, i can write into my csv file and find a specific record using string tokenizer but it is not updating/editing the specified cells of that record. the record remains the same. please help....
I have used http://opencsv.sourceforge.net in java
Hi,
This is the code to update CSV by specifying row and column
/**
* Update CSV by row and column
*
* #param fileToUpdate CSV file path to update e.g. D:\\chetan\\test.csv
* #param replace Replacement for your cell value
* #param row Row for which need to update
* #param col Column for which you need to update
* #throws IOException
*/
public static void updateCSV(String fileToUpdate, String replace,
int row, int col) throws IOException {
File inputFile = new File(fileToUpdate);
// Read existing file
CSVReader reader = new CSVReader(new FileReader(inputFile), ',');
List<String[]> csvBody = reader.readAll();
// get CSV row column and replace with by using row and column
csvBody.get(row)[col] = replace;
reader.close();
// Write to CSV file which is open
CSVWriter writer = new CSVWriter(new FileWriter(inputFile), ',');
writer.writeAll(csvBody);
writer.flush();
writer.close();
}
This solution worked for me,
Cheers!
I used the below code where I will replace a string with another and it worked exactly the way I needed:
public static void updateCSV(String fileToUpdate) throws IOException {
File inputFile = new File(fileToUpdate);
// Read existing file
CSVReader reader = new CSVReader(new FileReader(inputFile), ',');
List<String[]> csvBody = reader.readAll();
// get CSV row column and replace with by using row and column
for(int i=0; i<csvBody.size(); i++){
String[] strArray = csvBody.get(i);
for(int j=0; j<strArray.length; j++){
if(strArray[j].equalsIgnoreCase("Update_date")){ //String to be replaced
csvBody.get(i)[j] = "Updated_date"; //Target replacement
}
}
}
reader.close();
// Write to CSV file which is open
CSVWriter writer = new CSVWriter(new FileWriter(inputFile), ',');
writer.writeAll(csvBody);
writer.flush();
writer.close();
}
You're doing something like this:
String line = readLineFromFile();
line.replace(...);
This is not editing the file, it's creating a new string from a line in the file.
String instances are immutable, so the replace call you're making returns a new string it does not modify the original string.
Either use a file stream that allows you to both read and write to the file - i.e. RandomAccessFile or (more simply) write to a new file then replace the old file with the new one
In psuedo code:
for (String line : inputFile) {
String [] processedLine = processLine(line);
outputFile.writeLine(join(processedLine, ","));
}
private String[] processLine(String line) {
String [] cells = line.split(","); // note this is not sufficient for correct csv parsing.
for (int i = 0; i < cells.length; i++) {
if (wantToEditCell(cells[i])) {
cells[i] = "new cell value";
}
}
return cells;
}
Also, please take a look at this question. There are libraries to help you deal with csv.
CSV file is just a file. It is not being changed if you are reading it.
So, write your changes!
You have 3 ways.
1
read line by line finding the cell you want to change.
change the cell if needed and composite new version of current line.
write the line into second file.
when you finished you have the source file and the result file. Now if you want you can remove the source file and rename the result file to source.
2
Use RandomAccess file to write into specific place of the file.
3
Use one of available implementations of CSV parser (e.g. http://commons.apache.org/sandbox/csv/)
It already supports what you need and exposes high level API.