Issue with Hive streaming and Azure Data Lake Store - java

I am writing a Play2 Java web application to ingest data to HDInsight interactive query using the Hive Streaming API(https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest). Hive data is stored on Azure Data Lake Store.
I loosely based myself on https://github.com/mradamlacey/hive-streaming-azure-hdinsight/blob/master/src/main/java/com/cbre/eim/HiveStreamingExample.java.
When I run the code on one of my headnodes I receive the following error:
play.api.UnexpectedException: Unexpected exception[StreamingIOFailure: Failed creating RecordUpdaterS for adl://home/hive/warehouse/data/ingest_date=2018-05-07 txnIds[486,495]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:251)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:182)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:343)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:341)
at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:414)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
Caused by: org.apache.hive.hcatalog.streaming.StreamingIOFailure: Failed creating RecordUpdaterS for adl://home/hive/warehouse/data/ingest_date=2018-05-07 txnIds[486,495]
at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.newBatch(AbstractRecordWriter.java:166)
at org.apache.hive.hcatalog.streaming.StrictJsonWriter.newBatch(StrictJsonWriter.java:41)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.<init>(HiveEndPoint.java:559)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.<init>(HiveEndPoint.java:512)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$ConnectionImpl.fetchTransactionBatchImpl(HiveEndPoint.java:397)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$ConnectionImpl.fetchTransactionBatch(HiveEndPoint.java:377)
at hive.HiveRepository.createMany(HiveRepository.java:76)
at controllers.HiveController.create(HiveController.java:40)
at router.Routes$$anonfun$routes$1.$anonfun$applyOrElse$2(Routes.scala:70)
at play.core.routing.HandlerInvokerFactory$$anon$4.resultCall(HandlerInvoker.scala:137)
Caused by: java.io.IOException: No FileSystem for scheme: adl
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.hive.ql.io.orc.OrcRecordUpdater.<init>(OrcRecordUpdater.java:233)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getRecordUpdater(OrcOutputFormat.java:292)
at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.createRecordUpdater(AbstractRecordWriter.java:226)
I raised the question on the Microsoft forum as well and on the Hive jira.
I can confirm that the jars described here are present in the classpath:
com.microsoft.azure.azure-data-lake-store-sdk-2.2.5.jar
org.apache.hadoop.hadoop-azure-datalake-3.1.0.jar

No FileSystem for scheme
You get this error when the filesystem is not configured which probably needs to be done at both the HiveServer and your local client's core-site.xml files
Just because the JARs exist doesn't mean they are loaded onto the classpath and configured to read from your Azure account

Related

Flink Oracle JDBC sink connector not loading the driver

I am trying to create Flink JBDC sink to an oracle database. When run locally (from a junit test and minicluster) it works but when deployed in k8s it throws an exception saying it cannot find a suitable Driver. The Classpath is:
Classpath: /flink/lib/flink-cep-scala_2.12-1.13.5-stream1.jar:/flink/lib/flink-connector-jdbc_2.12-1.13.5.jar:/flink/lib/flink-csv-1.13.5-stream1.jar:/flink/lib/flink-json-1.13.5-stream1.jar:/flink/lib/flink-queryable-state-runtime_2.12-1.13.5-stream1.jar:/flink/lib/flink-shaded-netty-tcnative-dynamic-2.0.30.Final-13.0-stream1.jar:/flink/lib/flink-shaded-zookeeper-3.4.14.jar:/flink/lib/flink-table-blink_2.12-1.13.5-stream1.jar:/flink/lib/flink-table_2.12-1.13.5-stream1.jar:/flink/lib/log4j-1.2-api-2.16.0.jar:/flink/lib/log4j-api-2.16.0.jar:/flink/lib/log4j-core-2.16.0.jar:/flink/lib/log4j-slf4j-impl-2.16.0.jar:/flink/lib/ojdbc8-21.5.0.0.jar:/flink/lib/vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.jar:/flink/lib/flink-dist_2.12-1.13.5-stream1.jar:::
I tried multiple things:
Included the driver in the flink/lib directory and the flink-connector-jdbc connector was packaged within the the jar and .withDriverName("oracle.jdbc.OracleDriver") /.withDriverName("oracle.jdbc.driver.OracleDriver")
Included both the driver and the connector into the flink/lib directory and .withDriverName("oracle.jdbc.OracleDriver") / .withDriverName("oracle.jdbc.driver.OracleDriver")
I also tried to change the classloading configuration to classloader.parent-first-patterns.additional: oracle.jdbc.
but nothing seems to be working for me. The exception is:
failure cause: java.io.IOException: unable to open JDBC writer
at org.apache.flink.connector.jdbc.internal.AbstractJdbcOutputFormat.open(AbstractJdbcOutputFormat.java:56)
at org.apache.flink.connector.jdbc.internal.JdbcBatchingOutputFormat.open(JdbcBatchingOutputFormat.java:115)
at org.apache.flink.connector.jdbc.internal.GenericJdbcSinkFunction.open(GenericJdbcSinkFunction.java:49)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:34)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at org.apache.flink.streaming.api.operators.StreamSink.open(StreamSink.java:46)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:442)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:585)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:565)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:540)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLException: No suitable driver found for "jdbc:oracle:thin:#//SOMECONNECTION"
at org.apache.flink.connector.jdbc.internal.connection.SimpleJdbcConnectionProvider.getOrEstablishConnection(SimpleJdbcConnectionProvider.java:126)
at org.apache.flink.connector.jdbc.internal.AbstractJdbcOutputFormat.open(AbstractJdbcOutputFormat.java:54)
... 14 more
What am I missing?
There is no support in Flink 1.13 for Oracle via JDBC, that was only added in Flink 1.15

Connect delta lake table for CRUD operation in azure ADLS GEN2 path using only pure java code

I am trying to connect using DSR method but getting following error while reading snapshot parquet files on azure adlsgen2 path.
I have added some maven dependencies: Ex. hadoop-client,hadoop-azure,parquet-hadoop,scala-library,spark-core_2.12
Error:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcJ$sp
at io.delta.standalone.internal.SnapshotManagement.getLogSegmentForVersion(SnapshotManagement.scala:102)
at io.delta.standalone.internal.SnapshotManagement.getLogSegmentForVersion$(SnapshotManagement.scala:96)
at io.delta.standalone.internal.DeltaLogImpl.getLogSegmentForVersion(DeltaLogImpl.scala:32)
at io.delta.standalone.internal.SnapshotManagement.getSnapshotAtInit(SnapshotManagement.scala:201)
at io.delta.standalone.internal.SnapshotManagement.$init$(SnapshotManagement.scala:35)
at io.delta.standalone.internal.DeltaLogImpl.<init>(DeltaLogImpl.scala:36)
at io.delta.standalone.internal.DeltaLogImpl$.apply(DeltaLogImpl.scala:83)
at io.delta.standalone.internal.DeltaLogImpl$.forTable(DeltaLogImpl.scala:72)
at io.delta.standalone.internal.DeltaLogImpl.forTable(DeltaLogImpl.scala)
at io.delta.standalone.DeltaLog.forTable(DeltaLog.java:86)
at com.example.demo.DeltaLakeApplication.main(DeltaLakeApplication.java:38)

istio error: Details: java.io.IOException: Unknown apiVersionKind

I receive this error message in jenkins when trying to deploy my application to kubernetes. Is there something I am missing?
istio-gateway.yml ERROR: ERROR: java.io.IOException: ERROR: YAML file
istio-gateway.yml is invalid, please check it. Details:
java.io.IOException: Unknown apiVersionKind:
networking.istio.io/v1alpha3/Gateway known kinds are...
I am using Spring Boot with Java
You are trying to create a Istio Geteway using that yaml. You probably don't have istio installed in your Kubernetes cluster. If it's not installed then you need to install it.

java.io.IOException: No FileSystem for scheme: abfs for adls-gen 2 in spark java

I am trying to access adls gen2 in spark java with following configuration properties.
fs.azure.account.auth.type
fs.azure.account.oauth.provider.type
fs.azure.account.oauth2.client.endpoint
fs.azure.account.oauth2.client.id
fs.azure.account.oauth2.client.secret
I have created the blob container and uploaded the file path ex.https://devbdstreamsv2.dfs.core.windows.net/gen2container/adlsgen2/flat.json in it using the software "Azure storage Explorer" version 1.9 .I am trying to access the abfs filepath which I am using according to the code mentioned in the document.abfs[s]://<file_system>#<account_name>.dfs.core.windows.net/<path>/
But my doubt is we are not initialising abfs filepath anywhere in the runner code.So I am getting the exception " No FileSystem for scheme: abfs ".How can i resolve this issue?I want to know Initialization of abfs filesystem using spark java for adls gen2.
You need a distribution of Spark which has the abfs connector in the hadoop-azure JAR. The hadoop-2.7.x JARs in the normal ASF releases do not, as abfs came out later (2.9+)

Azure specific reading files from local on spark

I am struggling with Azure wasb on spark
I am reading loading a .json.gz file from disk and loading it into hdfs. I have used the following code extensively on other systems.
val file_a_raw = sqlContext.read.json('/home/users/repo_test/file_a.json.gz')
However, on Azure, this returns:
java.io.FileNotFoundException: Filewasb://server-2017-03-07t08-13-41-314z#server.blob.core.windows.net/home/users/repo_test/file_a.json.gz does not exist.
I have checked this location and the file is there and correct.
I think there should be a : between .net and then file path, but I get a java error trying to manually add that in.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme name at index 0:
I've also tried:
Filewasb:///home/users/repo_test/file_a.json.gz
But that returns:
java.io.IOException: No FileSystem for scheme: Filewasb
This code works fine on non Azure spark
For Azure, you'll need to configure Spark with the proper credentials. Databricks has documentation on this: https://docs.databricks.com/user-guide/faq/azure-blob-storage.html

Categories