Azure specific reading files from local on spark

Azure specific reading files from local on spark - java

I am struggling with Azure wasb on spark
I am reading loading a .json.gz file from disk and loading it into hdfs. I have used the following code extensively on other systems.
val file_a_raw = sqlContext.read.json('/home/users/repo_test/file_a.json.gz')
However, on Azure, this returns:
java.io.FileNotFoundException: Filewasb://server-2017-03-07t08-13-41-314z#server.blob.core.windows.net/home/users/repo_test/file_a.json.gz does not exist.
I have checked this location and the file is there and correct.
I think there should be a : between .net and then file path, but I get a java error trying to manually add that in.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme name at index 0:
I've also tried:
Filewasb:///home/users/repo_test/file_a.json.gz
But that returns:
java.io.IOException: No FileSystem for scheme: Filewasb
This code works fine on non Azure spark

For Azure, you'll need to configure Spark with the proper credentials. Databricks has documentation on this: https://docs.databricks.com/user-guide/faq/azure-blob-storage.html

Related

Java sftpChannel put error (The system cannot find the path specified)

I am going to transfer files to remote sftp by Java sftpchannel. Everything are going to be as expected. It was well tested on STS (Spring Tool Suite 4.7.1). But it failed when it was deployed to tomcat server.
// Logs
File path: S:/System/AutoSend/Data.json
Remote path: Data.json
Before sftp put
Sftp error: 4: java.io.FileNotFoundException: S:\System\AutoSend\Data.json (The system cannot find the path specified)
(Unix-formated path has been transformed to windows format automatically?)
What can I do to fix the issue? Thanks a lot.

Have you tried making the File Path a String? Like this: "S:/System/AutoSend/Data.json"
and
is the "S" Drive on your Tomcat-Server? If not, try using the IP-address instead.

Error while trying to write on parquet file in datastage 11.7 (File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem)

we have recently upgraded the DataStage from 9.1 to 11.7 on Server AIX 7.1 .
and i'm trying to use the new connector "File Connector" to write on parquet file. i created simple job takes from teradata as a source and write on the parquet file as a target.
Image of the job
but facing below error :
> File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem
at java.lang.J9VMInternals.prepareClassImpl (J9VMInternals.java)
at java.lang.J9VMInternals.prepare (J9VMInternals.java: 304)
at java.lang.Class.getConstructor (Class.java: 594)
at com.ibm.iis.jis.utilities.dochandler.impl.OutputBuilder.<init> (OutputBuilder.java: 80)
at com.ibm.iis.jis.utilities.dochandler.impl.Registrar.getBuilder (Registrar.java: 340)
at com.ibm.iis.jis.utilities.dochandler.impl.Registrar.getBuilder (Registrar.java: 302)
at com.ibm.iis.cc.filesystem.FileSystem.getBuilder (FileSystem.java: 2586)
at com.ibm.iis.cc.filesystem.FileSystem.writeFile (FileSystem.java: 1063)
at com.ibm.iis.cc.filesystem.FileSystem.process (FileSystem.java: 935)
at com.ibm.is.cc.javastage.connector.CC_JavaAdapter.run (CC_JavaAdapter.java: 444)
i followed the steps in below link :
https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.7.0/com.ibm.swg.im.iis.conn.s3.usage.doc/topics/amaze_file_formats.html
1- i uploaded the jar files into "/ds9/IBM/InformationServer/Server/DSComponents/jars"
2- added them to CLASSPATH in agent.sh then restarted the datastage.
3- i have set The environment variable CC_USE_LATEST_FILECC_JARS to the value parquet-1.9.0.jar:orc-2.1.jar.
i tried also to add the CLASSPATH as an environment variable in the job but not worked.
noting that i'm using Local in File System.
so any hint is appreciated as i'm searching a lot time ago.
Thanks in advance,

Which File System mode you are using ? If you are using Native HDFS as File System mode, then you would need to configure CLASSPATH to include some third party jars.
Perhaps these links should provide you with some guidance.
https://www.ibm.com/support/pages/node/301847
https://www.ibm.com/support/pages/steps-required-configure-file-connector-use-parquet-or-orc-file-format
Note : Based on the hadoop distribution and version you are using, the version of the jars could be different.
If the above information does not help in resolving the issue, then you may have to reach out to IBM Support to get this addressed.

TO use File Connector, there is no need to add CLASSPATH in agent.sh unless you want to import HDFS files from IMAM.
If your requirement is reading Parquet files, then set
$CC_USE_LATEST_FILECC_JARS=parquet-1.9.0.jar
$FILECC_PARQUET_AVRO_COMPAT_MODE=TRUE
If you are still seeing issue, then run job with $CC_MSG_LEVEL=2 and open IBM support case along with job design, FULL job log and Version.xml file from Engine tier.

java.io.IOException: No FileSystem for scheme: abfs for adls-gen 2 in spark java

I am trying to access adls gen2 in spark java with following configuration properties.
fs.azure.account.auth.type
fs.azure.account.oauth.provider.type
fs.azure.account.oauth2.client.endpoint
fs.azure.account.oauth2.client.id
fs.azure.account.oauth2.client.secret
I have created the blob container and uploaded the file path ex.https://devbdstreamsv2.dfs.core.windows.net/gen2container/adlsgen2/flat.json in it using the software "Azure storage Explorer" version 1.9 .I am trying to access the abfs filepath which I am using according to the code mentioned in the document.abfs[s]://<file_system>#<account_name>.dfs.core.windows.net/<path>/
But my doubt is we are not initialising abfs filepath anywhere in the runner code.So I am getting the exception " No FileSystem for scheme: abfs ".How can i resolve this issue?I want to know Initialization of abfs filesystem using spark java for adls gen2.

You need a distribution of Spark which has the abfs connector in the hadoop-azure JAR. The hadoop-2.7.x JARs in the normal ASF releases do not, as abfs came out later (2.9+)

Issue with Hive streaming and Azure Data Lake Store

I am writing a Play2 Java web application to ingest data to HDInsight interactive query using the Hive Streaming API(https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest). Hive data is stored on Azure Data Lake Store.
I loosely based myself on https://github.com/mradamlacey/hive-streaming-azure-hdinsight/blob/master/src/main/java/com/cbre/eim/HiveStreamingExample.java.
When I run the code on one of my headnodes I receive the following error:
play.api.UnexpectedException: Unexpected exception[StreamingIOFailure: Failed creating RecordUpdaterS for adl://home/hive/warehouse/data/ingest_date=2018-05-07 txnIds[486,495]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:251)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:182)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:343)
at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:341)
at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:414)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
Caused by: org.apache.hive.hcatalog.streaming.StreamingIOFailure: Failed creating RecordUpdaterS for adl://home/hive/warehouse/data/ingest_date=2018-05-07 txnIds[486,495]
at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.newBatch(AbstractRecordWriter.java:166)
at org.apache.hive.hcatalog.streaming.StrictJsonWriter.newBatch(StrictJsonWriter.java:41)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.<init>(HiveEndPoint.java:559)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$TransactionBatchImpl.<init>(HiveEndPoint.java:512)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$ConnectionImpl.fetchTransactionBatchImpl(HiveEndPoint.java:397)
at org.apache.hive.hcatalog.streaming.HiveEndPoint$ConnectionImpl.fetchTransactionBatch(HiveEndPoint.java:377)
at hive.HiveRepository.createMany(HiveRepository.java:76)
at controllers.HiveController.create(HiveController.java:40)
at router.Routes$$anonfun$routes$1.$anonfun$applyOrElse$2(Routes.scala:70)
at play.core.routing.HandlerInvokerFactory$$anon$4.resultCall(HandlerInvoker.scala:137)
Caused by: java.io.IOException: No FileSystem for scheme: adl
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.hive.ql.io.orc.OrcRecordUpdater.<init>(OrcRecordUpdater.java:233)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getRecordUpdater(OrcOutputFormat.java:292)
at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.createRecordUpdater(AbstractRecordWriter.java:226)
I raised the question on the Microsoft forum as well and on the Hive jira.
I can confirm that the jars described here are present in the classpath:
com.microsoft.azure.azure-data-lake-store-sdk-2.2.5.jar
org.apache.hadoop.hadoop-azure-datalake-3.1.0.jar

No FileSystem for scheme
You get this error when the filesystem is not configured which probably needs to be done at both the HiveServer and your local client's core-site.xml files
Just because the JARs exist doesn't mean they are loaded onto the classpath and configured to read from your Azure account

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

So I've installed Hadoop File System on my machine and I'm using maven dependency to provide my code spark environment. (spark-mllib_2.10)
Now, My code is using spark mllib. And accessing data from Hadoop file system with this code.
String finalData = ProjectProperties.hadoopBasePath + ProjectProperties.finalDataPath;
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), finalData).toJavaRDD();
With following properties set.
finalDataPath = /data/finalInput.txt
hadoopBasePath = hdfs://127.0.0.1:54310
I am starting the dfs nodes externally through command
start-dfs.sh
Now, my code works perfectly fine when running from eclipse. But if I export the whole code to an executable jar, it gives me following exception.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
I also checked different solutions online given for this issue where people are asking me to add following
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
OR
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
But I don't use any Hadoop context or hadoop config into my project. Simply load the data from Hadoop using the URL.
Can someone give some answer relevant to this issue?
Please mind that this totally works fine from Eclipse. And only doesn't work if I export the same project as an executable Jar.
Update
As suggested in the comment and from the solutions found online, I tried two things.
Added dependencies into my pom.xml for hadoop-core, hadoop-hdfs and hadoop-client libraries.
Added the above properties configuration to hadoop's site-core.xml as suggested here http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
But still no luck in getting the error resolved. Gives the same issue locally on my machine as well as one of the remote machines I tried it on.
I also installed hadoop the same way I did on my machine using the link mentioned above.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Azure specific reading files from local on spark - java

For Azure, you'll need to configure Spark with the proper credentials. Databricks has documentation on this: https://docs.databricks.com/user-guide/faq/azure-blob-storage.html

Related

Java sftpChannel put error (The system cannot find the path specified)

Error while trying to write on parquet file in datastage 11.7 (File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem)

java.io.IOException: No FileSystem for scheme: abfs for adls-gen 2 in spark java

Issue with Hive streaming and Azure Data Lake Store

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

Categories

Resources