ZWNBSP appears when parsing CSV

ZWNBSP appears when parsing CSV - java

I have a CSV and I want to check if it has all the data it should have. But it looks like ZWNBSP appears at the beginning of the 1st column name in the 1st string.
My simplified code is
#Test
void parseCsvTest() throws Exception {
Configuration.holdBrowserOpen = true;
ClassLoader classLoader = getClass().getClassLoader();
try (
InputStream inputStream = classLoader.getResourceAsStream("files/csv_example.csv");
CSVReader reader = new CSVReader(new InputStreamReader(inputStream))
) {
List<String[]> content = reader.readAll();
var csvStrings0line = content.get(0);
var csv1stElement = csvStrings0line[0];
var csv1stElementShouldBe = "Timestamp";
assertEquals(csv1stElementShouldBe,csv1stElement);
My CSV contains
"Timestamp","Source","EventName","CountryId","Platform","AppVersion","DeviceType","OsVersion"
"2022-05-02T14:56:59.536987Z","courierapp","order_delivered_sent","643","ios","3.11.0","iPhone 11","15.4.1"
"2022-05-02T14:57:35.849328Z","courierapp","order_delivered_sent","643","ios","3.11.0","iPhone 8","15.3.1"
My test fails with
expected: <Timestamp> but was: <Timestamp>
Expected :Timestamp
Actual :Timestamp
<Click to see difference>
Clicking on the see difference shows that there is a ZWNBSP at the beginning of the Actual text.
Copypasting my text to the online tool for displaying non-printable unicode characters https://www.soscisurvey.de/tools/view-chars.php shows only CR LF at the ends of the lines, no ZWNBSPs.
But where does it come from?

It's a BOM character. You may remove it yourself or use several other solutions (see https://stackoverflow.com/a/4897993/1420794 for instance)

That is the Unicode zero-width no-break space character. When used at the beginning of Unicode encoded text files, it serves as a 'byte-order-mark' . You read it to determine the encoding of the text file, then you can safely discard it if you want. The best thing you can do is spread awareness.

Related

Encoding for unicode and & characters

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXXÂ®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader

Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXXÂ®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF

Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString

As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

CSVParser processes LF as CRLF

I am trying to parse a CSV file as below
String NEW_LINE_SEPARATOR = "\r\n";
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withRecordSeparator(NEW_LINE_SEPARATOR);
FileReader fr = new FileReader("201404051539.csv");
CSVParser csvParser = csvFileFormat.withHeader().parse(fr);
List<CSVRecord> recordsList = csvParser.getRecords();
Now the file got normal lines ending with CRLF characters however for few lines there is additional LF character appearing in middle.
i.e.
a,b,c,dCRLF --line1
e,fLF,g,h,iCRLF --line2
Due to this, the parse operation creates three records whereas actually they are only two.
Is there a way I can get the LF character appearing in middle of second line not treated as line break and get two records only upon parsing?
Thanks

I think uniVocity-parsers is the only parser you will find that will work with line endings as you expect.
The equivalent code using univocity-parsers will be:
CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial
settings.getFormat().setLineSeparator("\r\n");
settings.getFormat().setNormalizedNewline('\u0001'); //uses a special character to represent a new record instead of \n.
settings.setNormalizeLineEndingsWithinQuotes(false); //does not replace \r\n by the normalized new line when reading quoted values.
settings.setHeaderExtractionEnabled(true); //extract headers from file
settings.trimValues(false); //does not remove whitespaces around values
CsvParser parser = new CsvParser(settings);
List<Record> recordsList = parser.parseAllRecords(new File("201404051539.csv"));
If you define a line separator to be \r\n then this is the ONLY sequence of characters that should identify a new record (when outside quotes). All values can have either \r or \n without being enclosed in quotes because that's NOT the line separator sequence.
When parsing the input sample you gave:
String input = "a,b,c,d\r\ne,f\n,g,h,i\r\n";
parser.parseAll(new StringReader(input));
The result will be:
LINE1 = [a, b, c, d]
LINE2 = [e, f
, g, h, i]
Disclosure: I'm the author of this library. It's open-source and free (Apache 2.0 license)

How to print pretty JSON using docx4j into a word document?

I want to print a simple pretty json string (containing multiple line breaks - many \n) into a word document. I tried the following but docx4j just prints all the contents inline in one single line (without \n). Ideally it should print multiline pretty json as it is recognising the "\n" the json string contains :
1)
wordMLPackage.getMainDocumentPart().addParagraphOfText({multiline pretty json String})
2)
ObjectFactory factory = Context.getWmlObjectFactory();
P p = factory.createP();
Text t = factory.createText();
t.setValue(text);
R run = factory.createR();
run.getContent().add(t);
p.getContent().add(run);
PPr ppr = factory.createPPr();
p.setPPr(ppr);
ParaRPr paraRpr = factory.createParaRPr();
ppr.setRPr(paraRpr);
wordMLPackage.getMainDocumentPart().addObject(p);
Looking for help. Thanks.

The docx file format doesn't treat \n as a newline.
So you'll need to split your string on \n, and either create a new P, or use w:br, like so:
Br br = wmlObjectFactory.createBr();
run.getContent().add( br);

DbUnit NoSuchTableException - Workaround for long table names in Oracle

I'm working on creating a test suite that runs on multiple databases using dbunit xml. Unfortunately, yesterday I discovered that some table names in our schema are over 30 characters, and are truncated for Oracle. For example, a table named unusually_long_table_name_error in mysql is named unusually_long_table_name_erro in Oracle. This means that my dbunit file contains lines like <unusually_long_table_name_error col1="value1" col2="value2 />. These lines throw a NoSuchTableException when running the tests in Oracle.
Is there a programmatic workaround for this? I'd really like to avoid generating special xml files for Oracle. I looked into a custom MetadataHandler but it returns lots of java.sql datatypes that I don't know how to intercept/spoof. I could read the xml myself, truncate each table name to 30 characters, write that out to a temp file or StringBufferInputStream and then use that as input to my DataSetBuilder, but that seems like a whole lot of steps to accomplish very little. Maybe there's some ninja Oracle trick with synonyms or stored procedures or goodness-know-what-else that could help me. Is one of these ideas clearly better than the others? Is there some other approach that would blow me away with its simplicity and elegance? Thanks!

In light of the lack of answers, I ended up going with my own suggested approach, which
Reads the .xml file
Regex's out the table name
Truncates the table name if it's over 30 characters
Appends the (potentially modified) line to a StringBuilder
Feeds that StringBuilder into a ByteArrayInputStream, suitable for passing into a DataSetBuilder
public InputStream oracleWorkaroundStream(String fileName) throws IOException
{
String ls = System.getProperty("line.separator");
// This pattern isolates the table name from the rest of the line
Pattern pattern = Pattern.compile("(\\s*<)(\\w+)(.*/>)");
FileInputStream fis = new FileInputStream(fileName);
// Use a StringBuidler for better performance over repeated concatenation
StringBuilder sb = new StringBuilder(fis.available()*2);
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader buff = new BufferedReader(isr);
while (buff.ready())
{
// Read a line from the source xml file
String line = buff.readLine();
Matcher matcher = pattern.matcher(line);
// See if the line contains a table name
if (matcher.matches())
{
String tableName = matcher.group(2);
if (tableName.length() > 30)
{
tableName = tableName.substring(0, 30);
}
// Append the (potentially modified) line
sb.append(matcher.group(1));
sb.append(tableName);
sb.append(matcher.group(3));
}
else
{
// Some lines don't have tables names (<dataset>, <?xml?>, etc.)
sb.append(line);
}
sb.append(ls);
}
return new ByteArrayInputStream(sb.toString().getBytes("UTF-8"));
}
EDIT: Swtiched to StringBuilder from repeated String concatenation, which gives a huge performance boost

Why is Java BufferedReader() not reading Arabic and Chinese characters correctly?

I'm trying to read a file which contain English & Arabic characters on each line and another file which contains English & Chinese characters on each line. However the characters of the Arabic and Chinese fail to show correctly - they just appear as question marks. Any idea how I can solve this problem?
Here is the code I use for reading:
try {
String sCurrentLine;
BufferedReader br = new BufferedReader(new FileReader(directionOfTargetFile));
int counter = 0;
while ((sCurrentLine = br.readLine()) != null) {
String lineFixedHolder = converter.fixParsedParagraph(sCurrentLine);
System.out.println("The line number "+ counter
+ " contain : " + sCurrentLine);
counter++;
}
}
Edition 01
After reading the line and getting the Arabic and Chinese word I use a function to translate them by simply searching for Given Arabic Text in an ArrayList (which contain all expected words) (using indexOf(); method). Then when the word's index is found it's used to call the English word which has the same index in another Arraylist. However this search always returns false because it fails when searching the question marks instead of the Arabic and Chinese characters. So my System.out.println print shows me nulls, one for each failure to translate.
*I'm using Netbeans 6.8 Mac version IDE
Edition 02
Here is the code which search for translation:
int testColor = dbColorArb.indexOf(wordToTranslate);
int testBrand = -1;
if ( testColor != -1 ) {
String result = (String)dbColorEng.get(testColor);
return result;
} else {
testBrand = dbBrandArb.indexOf(wordToTranslate);
}
//System.out.println ("The testBrand is : " + testBrand);
if ( testBrand != -1 ) {
String result = (String)dbBrandEng.get(testBrand);
return result;
} else {
//System.out.println ("The first null");
return null;
}
I'm actually searching 2 Arraylists which might contain the the desired word to translate. If it fails to find them in both ArrayLists, then null is returned.
Edition 03
When I debug I found that lines being read are stored in my String variable as the following:
"3;0000000000;0000001001;1996-06-22;;2010-01-27;����;;01989;������;"
Edition 03
The file I'm reading has been given to me after it has been modified by another program (which I know nothing about beside it's made in VB) the program made the Arabic letters that are not appearing correctly to appear. When I checked the encoding of the file on Notepad++ it showed that it's ANSI. however when I convert it to UTF8 (which replaced the Arabic letter with other English one) and then convert it back to ANSI the Arabic become question marks!

FileReader javadoc:
Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
So:
Reader reader = new InputStreamReader(new FileInputStream(fileName), "utf-8");
BufferedReader br = new BufferedReader(reader);
If this still doesn't work, then perhaps your console is not set to properly display UTF-8 characters. Configuration depends on the IDE used and is rather simple.
Update : In the above code replace utf-8 with cp1256. This works fine for me (WinXP, JDK6)
But I'd recommend that you insist on the file being generated using UTF-8. Because cp1256 won't work for Chinese and you'll have similar problems again.

IT is most likely Reading the information in correctly, however your output stream is probably not UTF-8, and so any character that cannot be shown in your output character set is being replaced with the '?'.
You can confirm this by getting each character out and printing the character ordinal.

public void writeTiFile(String fileName,String str){
try {
FileOutputStream out = new FileOutputStream(fileName);
out.write(str.getBytes("windows-1256"));
} catch (Exception ex) {
ex.printStackTrace();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ZWNBSP appears when parsing CSV - java

It's a BOM character. You may remove it yourself or use several other solutions (see https://stackoverflow.com/a/4897993/1420794 for instance)

That is the Unicode zero-width no-break space character. When used at the beginning of Unicode encoded text files, it serves as a 'byte-order-mark' . You read it to determine the encoding of the text file, then you can safely discard it if you want. The best thing you can do is spread awareness.

Related

Encoding for unicode and & characters

CSVParser processes LF as CRLF

How to print pretty JSON using docx4j into a word document?

DbUnit NoSuchTableException - Workaround for long table names in Oracle

Why is Java BufferedReader() not reading Arabic and Chinese characters correctly?

Categories

Resources