Apache Sqoop/Pig field escaping

Apache Sqoop/Pig field escaping - java

We are exporting some data from MySQL using Sqoop, doing some processing with it via Apache Pig, and then attempting to export that data from HDFS back into a MySQL database. However, when exporting the data, we are running into issues:
java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NumberFormatException: For input string: ".proseries.com"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:449)
at java.lang.Integer.valueOf(Integer.java:554)
at mdm_urls.__loadFromFields(mdm_urls.java:419)
The HDFS data looks like (tab separated):
id:int url:text tld:text port:int
Somehow the tld field is being imported into the port column for some rows. Out of ~250M rows, this is only the case for less than 10. My initial assumption was that the url field must have a tab in it. However, we have stripped all tabs in our Pig script:
REGISTER target/mystuff.jar;
legacy_urls = LOAD 'url' USING PigStorage(',') AS (id, sha1, url_text);
legacy_urls_norm = FOREACH legacy_urls GENERATE id AS id, sha1 AS sha1, REPLACE(REPLACE(url_text, '\n', ''), '\t', '') AS url_text;
urls = FOREACH legacy_urls_norm GENERATE id, url_text, mystuff.RootDomain(url_text), mystuff.Protocol(url_text), mystuff.Host(url_text), mystuff.Path(url_text), mystuff.EffectiveTld(url_text), mystuff.Port(url_text), sha1;
STORE urls INTO 'mdm_urls';
Here is my sqoop export command:
sqoop export --connect jdbc:mysql://hostnmae/db_name --input-fields-terminated-by "\t" --table test --export-dir my_urls
I am having a difficult time debugging this because the sqoop errors do not give any indication as to what row it was working on (so that I can confirm if a tab char is still present, etc). My first question is, how might I better troubleshoot this issue? My second question is, how are people escaping bad input data with PIG?

Related

Unable to create record in MS-Dynamic CRM from mulesoft

Following is what I am doing.
I am using mule MS-Dynamics connector to create a contact
I get records from mysql Database (Inserted from source file)
Transform it to CRM specific object in dataweave
This works for over 10 Million records. But for a few hundred records
I am getting the following error:
Problem writing SAAJ model to stream: Invalid white space character (0x1f) in text to output (in xml 1.1, could output as a character entity)
With some research I found out that (0x1f) represents US "Unit separator".
I tried replacing this character in my dataweave like this
%var replaceSaaj = (x) -> (x replace /\"0x1f"/ with "" default "")
but the issue persists.
I even tried to look for these characters in my source file and database with no luck.
I am aware that this connector internally uses SOAP services.

How do I create a schema for Hive to parse deeply nested json (Azure Application Insights output) using SerDe?

I'm trying to create a schema for hive to parse json, however, I am having trouble creating the schema when the json doc is in the following structure:
{
"context": {
"custom": {
"dimensions": [{
"action": "GetFilters"
},
{
"userId": "12345678"
}]
}
}
}
I am using the Hadoop emulator for Azure's HDInsights on windows (8.1) and am using java (1.8.0_73). I compiled the SerDe successfully with Maven. I would think that the following would work:
add jar ../lib/json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar;
DROP TABLE events;
CREATE EXTERNAL TABLE events (
context STRUCT<custom:STRUCT<dimensions:array<STRUCT<action:string>,STRUCT<userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';
When I take out the nested ARRAY>, then the schema parses ok, but with it in, I get the following exception:
MismatchedTokenException(282!=9)
at org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecog
nizer.java:617)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa
rser.java:34909)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi
veParser.java:33113)
at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java
:36331)
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334
)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:35
054)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa
rser.java:34914)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi
veParser.java:33085)
at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java
:36331)
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334
)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:35
054)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.
java:34754)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HivePar
ser.java:32951)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveP
arser.java:4544)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.ja
va:2144)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.j
ava:1398)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:
1036)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:19
9)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:16
6)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:409)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2
68)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793
)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 2:69 missing > at ',' near 'STRUCT' in column specif
ication
line 2:76 mismatched input '<' expecting : near 'STRUCT' in column specification
hive>

I ended up getting it to work by removing the nested STRUCTs in the ARRAY STRUCT, but I have to access the values with [#]. For example, the following builds the schema:
DROP TABLE events;
CREATE EXTERNAL TABLE events (
context STRUCT<custom:STRUCT<dimensions:ARRAY<STRUCT<action:string,userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';
Then I can access items such as the userId like so:
SELECT context.custom.dimensions.userId[1] FROM events;
It works, but is not as readable as I would prefer.

That external table looked good to me. Maybe try downloading another distribution of that JSON serde. I have had success with:
http://www.congiu.net/hive-json-serde/
I have had success in HDInsight 3.2 with http://www.congiu.net/hive-json-serde/1.3/cdh5/ but you might try a newer build for HDP.
Documentation here:
https://github.com/rcongiu/Hive-JSON-Serde

I used an array of maps to process the App Insights custom dimensions with rcongiu's JSON SerDe.
This is using an HDInsight cluster with a linked blob storage account.
add jar wasb://<mycontainer>#<mystorage>.blob.core.windows.net/json-serde-1.3.7-jar-with-dependencies.jar;
DROP TABLE Logs;
CREATE EXTERNAL TABLE Logs (
event array<struct<
count:int,
name:string>
>,
context struct<
custom:struct<
dimensions:array<map<string, string>>
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'wasb://<mycontainer>#<mystorage>.blob.core.windows.net/events/';
SELECT
event[0].name as EventName,
context.custom.dimensions[0]['action'] as Action,
context.custom.dimensions[1]['key'] as Key
FROM Logs
WHERE event[0].name = 'Click';

neo4j upload csv UnmarshalException

I am uploading a 100MB csv file to neo4j containing transactional data. I am getting a java error that I cannot seem to trace to a setting or something that I can change.
neo4j-sh (?)$ CREATE CONSTRAINT ON (a:Account) ASSERT a.id IS UNIQUE;
+-------------------+
| No data returned. |
+-------------------+
Constraints added: 1
48 ms
neo4j-sh (?)$ USING PERIODIC COMMIT
> LOAD CSV FROM
> "file:/somepath/findata.csv"
> AS line
> FIELDTERMINATOR ','
> MERGE (a1:Account { id: toString(line[3]) })
> MERGE (a2:Account { id: toString(line[4]) })
> CREATE (a1)-[:LINK { value: toFloat(line[0]), date: line[5] } ]->(a2);
java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is:
java.io.EOFException
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:228)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
at com.sun.proxy.$Proxy1.interpretLine(Unknown Source)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:110)
at org.neo4j.shell.impl.AbstractClient.evaluate(AbstractClient.java:94)
at org.neo4j.shell.impl.AbstractClient.grabPrompt(AbstractClient.java:74)
at org.neo4j.shell.StartClient.grabPromptOrJustExecuteCommand(StartClient.java:357)
at org.neo4j.shell.StartClient.startRemote(StartClient.java:303)
at org.neo4j.shell.StartClient.start(StartClient.java:175)
at org.neo4j.shell.StartClient.main(StartClient.java:120)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:214)
... 11 more
I've tried the command twice and it gives me the same error. Sofar google has not helped me to figure out what I can do to circumvent this error. What is happening in neo4j and how could I solve this?

Maybe this isn't the problem, but your path to your CSV file may be malformed. This would explain the java.rmi.UnmarshalException. The path should be 'file://', where would be something like '/home/cantdutchthis/findata.csv' on a linux system. On a linux or Mac machine, this means that there will be 3 '/'s - 'file:///home/cantdutchthis/findata.csv'.
Grace and peace,
Jim

When I sort data using hadoop, I get the error

I want to use hadoop to compute the distance bewteen points and sort their result order by key, when run hadoop jar knn.jar input output, I get the errors as follows:
13/11/28 15:35:38 INFO mapred.JobClient: Task Id : attempt_2013
10221205_0036_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected or
g.apache.hadoop.io.Text, recieved org.apache.hadoop.io.DoubleWr
itable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.col
lect(MapTask.java:1019)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.
write(MapTask.java:690)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.w
rite(TaskInputOutputContext.java:80)
at xautjzd.knn.hadoop.apache.KNN$KNNMapper.map(KNN.java
:35)
at xautjzd.knn.hadoop.apache.KNN$KNNMapper.map(KNN.java
:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:1
45)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTas
k.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:36
4)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native M
ethod)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs
(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
My Code:

It looks like you have set conf.setOutputKeyClass(Text.class) where as it should be conf.setOutputKeyClass(DoubleWrtable.class)

MYSQL special chars issue

I´ve been having this problem for a long time, I´ve searched the internet many times for the solution, tried lots of them but not found an adequate solution.
I really don´t know what to do so if you could please help me I´d be very thankful.
(Sorry for my poor english).
Question: How can I solve the charset incompatibility between the input archive and a MYSql table?
Problem: When importing the archive from on my computer the information appears in my database, but some chars as ('ã', 'ç', 'á', etc..) are shown as ?.
Aditional information
I'm using MYSql, my version and variable status are:
MySQL VERSION : 5.5.10
HOST : localhost
USER : root
PORT : 3306
SERVER DEFAULT CHARSET : utf8
character_set_client : utf8
character_set_connection : utf8
character_set_database : utf8
character_set_filesystem : BINARY
character_set_results : utf8
character_set_server : utf8
character_set_system : utf8
collation_connection : utf8_general_ci
collation_database : utf8_general_ci
collation_server : utf8_general_ci
completion_type : NO_CHAIN
concurrent_insert : AUTO
The query that´s being used is:
LOAD DATA LOCAL INFILE 'xxxxx/file.txt'
INTO TABLE xxxxTable
FIELDS TERMINATED BY ';'
LINES TERMINATED BY ' '
IGNORE 1 LINES
( status_ordenar,numero,newstatus,rede,data_emissao,inicio,termino,tempo_indisp
, cli_afet,qtd_cli_afet,cod_encerr,uf_ofensor,localidades,clientes_afetados
, especificacao,equipamentos,area_ofens,descricao_encerr,criticidade,cod_erro
, observacao,id_falha_perc,id_falha_conf,nba,solucao,falhapercebida,falhaconfirmada
, resp_i,resp_f,resp_ue,pre_handover,falha_identificada,report_netcool,tipo_falha
, num_notificacao,equip_afetados,descricao)
About the file being imported:
I´ve opened the file with open office whith 3 charsets:
UTF8 - Gave me strange chars in place of the 'ç', 'ã', etc...
ISO-8859-1 - OK.
WIN-1252 - OK.
ASCII/US - OK.
Already tested: I´ve tested some charsets in my database: latin1, utf-8, ascii, but all of them gave me the same result (? instead of 'á', 'ç' etc).
Extra: I'm using Java with Java JDBC to generate and send the query.

file.txt is saved in ISO-8859-1 or Windows-1252 (these two are very similar), and being interpreted as UTF-8 by MySQL. These are incompatible.
How can I tell?
See point 3.: the file displays correctly when interpreted as ISO-8859-1 or Windows-1252.
See point 1.: character_set_database : utf8
Solution: either convert the file to UTF-8, or tell MySQL to interpret it as ISO-8859-1 or Windows-1252.
Background: the characters you provide (ã etc.) are single-byte values in windows-1252, and these bytes are illegal values in UTF-8, thus yielding the '?'s (unicode replacement characters).
Snippet from MySQL docs:
LOAD DATA INFILE Syntax
The character set indicated by the character_set_database system variable is used to interpret the information in the file.

Saved your characters with standard Windows Notepad as UTF-8 file (Notepad++ is also OK).
Exact file content:
'ã', 'ç', 'á'
MySQL version: 5.5.22
Database charset: utf8
Database collation: utf8_general_ci
CREATE TABLE `abc` (
`qwe` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Imported data with command
LOAD DATA LOCAL INFILE 'C:/test/utf8.txt'
INTO TABLE abc
FIELDS TERMINATED BY ';'
LINES TERMINATED BY ' '
IGNORE 1 LINES
( qwe)
Result (displayed in SQLyog):
So, first - you should check original file with reliable editor (notepad, notepad++). If file corrupted, then you should take another file.
Second - if file is OK, add you Java code for sending data to MySql into question.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Sqoop/Pig field escaping - java

Related

Unable to create record in MS-Dynamic CRM from mulesoft

How do I create a schema for Hive to parse deeply nested json (Azure Application Insights output) using SerDe?

neo4j upload csv UnmarshalException

When I sort data using hadoop, I get the error

MYSQL special chars issue

Categories

Resources