Removing special characters in dynamic schema - java

Requirement:
Source file:
abc,''test,data'',valid
xyz,''sample,data'',invalid
the data in source file we need to read dynamically. We are reading entire data in one string column. One of the value and file delimiter have comma separator. I have to load data in target table as follows with out double quotes
Target table :
Col1|Col2|Col3
abc|test,data|valid
xyz|sample,data|invalid

Related

Validate the column separator from a CSV file

I want to know if there's a way to validate the column separator for a CSV file, I'm using the CsvMapper library and I know that I can set the column separator using the .withColumnSeparator(). I have a CSV file and I want to validate that the file is using the right separator, I use as a separator the '|' character but sometimes the file can have a ';' separator or another one, but I want to validate that the separator is the '|' character.
E.g I have this CSV file with these two lines:
A|B|3|e
w|ew|34|w
This is a valid file because has as a separator the '|' char.
But sometimes I can received a CSV file like this:
A;B;3;e
w;ew;34;w
Which is a file separated by ';' Char and does not a '|', that's why I need to validate the column separator
Thanks a lot.

Read a text File which has numeric columns (like -10.0, -9.9, +9.9 etc.) through Apache Flink

I have requirement where I need to read a file which is generated by another application and file has 201 numeric column name like: -10.0, -9.9, -9.8, -9.7 .......0.....+9.7, +9.8, +9.9, +10.0 so total I have 201 columns in the file. I am reading many files through Flink but file has string type column name and I am creating an model Object with the attributes as columns name available in file as below
DataSet<Person>> csvInput = env.readCsvFile("file:///path/to/my/textfile")
.pojoType(Person.class, "name", "age", "zipcode");
above code will ready file and Person object will be populated with the values available in the File.
I am facing challenge in new requirement where file columns name is numeric and in Java I cannot create a variable with numeric value along with decimal like -10.0 etc.
like private String -10.0 not allowed in java
I am seeking for a solution, could any one please help me out here.

Generating unique IRI from a filename

I have an ontology, created using Protegé 4.3.0, and I would use the OWL-API in order to add some OWLNamedIndividual objects to a file OWL. I use the following instruction in order to create a new OWLNamedIndividual:
OWLNamedIndividual objSample = df.getOWLNamedIndividual(IRI.create(iri + "#" + id));
the variable id is a String;
iri is the base IRI of the loaded ontology; in order to get the base IRI of the ontology, I used the following instruction: iri = ontology.getOntologyID().getOntologyIRI().
So the new OWLNamedIndividual is added to the loaded ontology and then the ontology is saved to OWL file using the following instruction.
XMLWriterPreferences.getInstance().setUseNamespaceEntities(true);
OWLOntologyFormat format = manager.getOntologyFormat(ontology);
manager.saveOntology(ontology, format, IRI.create(file.toURI()));
The variable id is a String generated from the base name of a file (ie. the file name without the extension). If the base name of the file has one or more spaces in the name, the ontology is saved without any error, but when I open the newly saved OWL file, Protegé reports a parsing error at the first occurrence of the IRI containing spaces.
How could I create a valid IRI for an OWLNamedIndividual object using the base IRI of loaded ontology and the base name of a file?
IRIs are suppose to be a block that represents your resource. If I understand you correctly you have an id such as big boat and you are creating IRIs that look like <http://example.com#big boat>. This is not a valid IRI, and you need to replace the space with an _ or a -, such that you have <http://example.com#big_boat>. Even if you enter a modelling element name with a space in Protégé, it automatically will put a _ in the middle.
Take a look at the this article for the invalid characters in an IRI.
Systems accepting IRIs MAY also deal with the printable characters in
US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
"{", "}", "|", "\", "^", and "`", in step 2 above. If these
characters are found but are not converted, then the conversion
SHOULD fail.

Apache Pig process CSV with fields wrapped in quotes

How I can process CSV file where some fields are wrapped in quotes?
Line to process for example (field delimiter is ',')
I am column1, I am column2, "yes, I'm am column3"
The example has three columns. But the following example will say that I have four columns:
A = load '/path/to/file' using PigStorage(',');
Please, any suggestions, link to resource..?
Try loading the data, then do a FOREACH GENERATE to regenerate the data into whatever format you need. For the fields where you need to remove the quotes, use a REPLACE($3, '\"').
data = LOAD 'testdata' USING PigStorage(",");
data = FOREACH data GENERATE
(chararray) $0 AS col1:chararray,
(chararray) $1 AS col2:chararray,
(chararray) REPLACE($3, '\"') AS col3:chararray);

Extracting a column from a paragraph from a csv file using java

MAJOR ACC NO,MINOR ACC NO,STD CODE,TEL NO,DIST CODE
7452145,723456, 01,4213036,AAA
7254287,7863265, 01,2121920,AAA
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FOOCTYP,FOOCDES,FOOCAMT,FSTD,FTELNO,FNORECON,FXFRACCN,FLANGIND,CUR
12345,71234,7643234,AAA,001,DX,WLR Promotion - Insitu /Pre-Cabled PSTN Connection,-37.87,,,0,,E,EUR
FRUNDTE,FMACNO,FACCNO,FDISTCOD,FBILSEQ,FORDNO,FREF,FCHGDES,FCHGAMT,CUR,FORENFRM,FORENTO
3242241,72349489,2345352,AAA,001,30234843P ,1,NEW CONNECTION - PRECABLED CHARGE,37.87,EUR,2123422,201201234
12123471,7618412389,76333232,AAA,001,3123443P ,2,BROKEN PERIOD RENTAL,5.40,EUR,201234523,20123601
I have a csv file something like the one above and I want to extract certain columns from it. For example I want to extract the first column of the first paragraph. I'm kind of new to java but I am able to read the file but I want to extract certain columns from different paragraphs. Any help will be appreciated.

Categories