I am trying to access adls gen2 in spark java with following configuration properties.
fs.azure.account.auth.type
fs.azure.account.oauth.provider.type
fs.azure.account.oauth2.client.endpoint
fs.azure.account.oauth2.client.id
fs.azure.account.oauth2.client.secret
I have created the blob container and uploaded the file path ex.https://devbdstreamsv2.dfs.core.windows.net/gen2container/adlsgen2/flat.json in it using the software "Azure storage Explorer" version 1.9 .I am trying to access the abfs filepath which I am using according to the code mentioned in the document.abfs[s]://<file_system>#<account_name>.dfs.core.windows.net/<path>/
But my doubt is we are not initialising abfs filepath anywhere in the runner code.So I am getting the exception " No FileSystem for scheme: abfs ".How can i resolve this issue?I want to know Initialization of abfs filesystem using spark java for adls gen2.
You need a distribution of Spark which has the abfs connector in the hadoop-azure JAR. The hadoop-2.7.x JARs in the normal ASF releases do not, as abfs came out later (2.9+)
I am unable to access files in distributed cache in Hadoop 2.6. Below is a code snippet. I am attempting to place a file pattern.properties, which is in args[0] in the distributed cache of Yarn
Configuration conf1 = new Configuration();
Job job = Job.getInstance(conf1);
DistributedCache.addCacheFile(new URI(args[0]), conf1);
Also, I am trying to access the file in cache using the below:
Context context =null;
URI[] cacheFiles = context.getCacheFiles(); //Error at this line
System.out.println(cacheFiles);
But I am getting the below error at the line mentioned above:
java.lang.NullPointerException
I am not using Mapper class. It's just a spark stream code to access a file in cluster. I want the file to be distributed in the cluster. But I can't take it from HDFS.
I don't know whether I understood your question correctly.
We had some local files which we need to access in Spark streaming jobs.
We used this option:-
time spark-submit --files
/user/dirLoc/log4j.properties#log4j.properties 'rest other options'
Another way we tried was :- SparkContext.addFile()
We're running a Spring application within a docker container. Our application can take SVG files and transform them into PDF format to be embedded within a PDF.
The application works correctly on osx and transcodes as expected. However when run from inside a docker container, which has a different file system, the transcoder gets stuck and thrashes the cpu in some bizarre recursive file searching loop.
java.lang.Thread.State: RUNNABLE
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.isFile(File.java:882)
at org.apache.commons.io.filefilter.FileFileFilter.accept(FileFileFilter.java:59)
at org.apache.commons.io.filefilter.AndFileFilter.accept(AndFileFilter.java:122)
at org.apache.commons.io.filefilter.AndFileFilter.accept(AndFileFilter.java:122)
at org.apache.commons.io.filefilter.OrFileFilter.accept(OrFileFilter.java:118)
at java.io.File.listFiles(File.java:1291)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:357)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364)
at org.apache.commons.io.DirectoryWalker.walk(DirectoryWalker.java:364
Here's a look at the stack trace of a thread that ran the PDFTranscoder. Walk is called recursively for a while and then eventually getBooleanAttributes0 is called and everything blocks.
After some further research, we found out we could take a closer look at what is happening with the strace command and saw that the system is essentially spamming the following in an endless loop.
stat("/./sys/devices/pci0000:00/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/PNP0103:00/subsystem/devices/pcspkr/input/input1/subsystem/input0/subsystem/input0/uniq", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 <0.000224>
We seem to be getting blocked or hanging in the stat call. But we've delved so deep into system calls now that it's proving hard to debug. Does anyone have any ideas?
I was getting the same error. After trying many things to fix it, I came to the conclusion that it's an issue with your having fonts available to you on Mac OS X, while your (headless) Docker container OS has no fonts. The transcoder is not failing gracefully while searching for fonts all over the place. I solved it by forcing the transcoder to use the default fonts (and to not automatically look for other fonts) like this:
...
PDFTranscoder transcoder = new PDFTranscoder();
transcoder.addTranscodingHint(PDFTranscoder.KEY_AUTO_FONTS, false);
...
transcoder.transcode(transcoderInput, transcoderOutput);
...
Note this has the downside, of course, of falling back to its known fonts when it encounters one outside of the 14 fonts. I tried things to fix that but so far no luck.
I hope this helps someone.
I had the same issues and solved it in my case. This thread helped a lot. Now I'd like to put all the parts together - maybe also for other people who come accross this.
The reason for this is the directory in which you start your Java application. I recognized that this problem occurs under the following circumstances:
The Java application was started in the filesystem root.
Auto scanning for fonts is enabled in Apache FOP.
I found a similar post in Infinite scan for fonts in Apache FOP on CentOS. The explanation of Fyodor Sherstobitov sounds plausible.
Apache FOP uses the working directory of your Java application to scan for fonts. In this case this is the filesystem root. Therefore the whole filesystem will be scanned.
The following code is copied from PDFDocumentGraphics2DConfigurator. It shows that new File(".").getAbsoluteFile().toURI() is used - which is the working directory resp. the directory in which the Java application was started.
/**
* Creates the {#link FontInfo} instance for the given configuration.
* #param cfg the configuration
* #param useComplexScriptFeatures true if complex script features enabled
* #return the font collection
* #throws FOPException if an error occurs while setting up the fonts
*/
public static FontInfo createFontInfo(Configuration cfg, boolean useComplexScriptFeatures)
throws FOPException {
FontInfo fontInfo = new FontInfo();
final boolean strict = false;
if (cfg != null) {
URI thisUri = new File(".").getAbsoluteFile().toURI();
InternalResourceResolver resourceResolver
= ResourceResolverFactory.createDefaultInternalResourceResolver(thisUri);
//TODO The following could be optimized by retaining the FontManager somewhere
FontManager fontManager = new FontManager(resourceResolver, FontDetectorFactory.createDefault(),
FontCacheManagerFactory.createDefault());
//TODO Make use of fontBaseURL, font substitution and referencing configuration
//Requires a change to the expected configuration layout
DefaultFontConfig.DefaultFontConfigParser parser
= new DefaultFontConfig.DefaultFontConfigParser();
DefaultFontConfig fontInfoConfig = parser.parse(cfg, strict);
DefaultFontConfigurator fontInfoConfigurator
= new DefaultFontConfigurator(fontManager, null, strict);
List<EmbedFontInfo> fontInfoList = fontInfoConfigurator.configure(fontInfoConfig);
fontManager.saveCache();
FontSetup.setup(fontInfo, fontInfoList, resourceResolver, useComplexScriptFeatures);
} else {
FontSetup.setup(fontInfo, useComplexScriptFeatures);
}
return fontInfo;
}
You can solve this in two ways:
Disable auto scanning for fonts in Apache FOP as Bob Schultz mentioned. If you do that, you will have to configure the fonts for Apache FOP manually.
Don't start the Java application in the filesystem root as snyman mentioned. In this case you can continue using the auto scanning for fonts.
Disable Auto Scanning
This is a snippet of the code, which configures Apache FOP with a config file. If you don't enable the auto scan in that file, you don't have to disable it programatically.
// Load configuration for manually configuring fonts
DefaultConfigurationBuilder cfgBuilder = new DefaultConfigurationBuilder();
Configuration cfg = cfgBuilder.build(ResourceUtil.getResourceStream("path/to/config"));
PDFTranscoder transcoder = new PDFTranscoder();
transcoder.configure(cfg);
// Disable auto scanning for fonts programatically - not necessary if you
// don't enable auto scan in your config file
// transcoder.addTranscodingHint(PDFTranscoder.KEY_AUTO_FONTS, false);
Start the Application in a Separate Folder
By specifying the WORKDIR everything happens in this folder. The auto scan runs there and finishes fast and smooth.
FROM openjdk:8-jre-alpine
WORKDIR /app
ARG JAR_FILE=target/myapp-0.0.1-SNAPSHOT.jar
COPY ${JAR_FILE} app.jar
...
ENTRYPOINT ["java","-jar","app.jar"]
I had the same issue.
Solved it by setting the WORKDIR variable in the DockerFile.
I set it to my deployment dir, where I copy the spring jar file. Ie:
WORKDIR ${DEPLOYMENT_DIR}
using latest batik libraries in pom
<dependency>
<groupId>org.apache.xmlgraphics</groupId>
<artifactId>batik-all</artifactId>
<version>1.9.1</version>
</dependency>
<dependency>
<groupId>org.apache.xmlgraphics</groupId>
<artifactId>fop</artifactId>
<version>2.2</version>
</dependency>
I had the same problem on my project. I solve it with the downgrade of batik to 1.7 version.
I hope this will work for you.
Try adding the parameter '-Duser.dir=/%CATALINA_HOME/' to your CATALINA_OPTS. I encountered the same issue on my centos server.
So I've installed Hadoop File System on my machine and I'm using maven dependency to provide my code spark environment. (spark-mllib_2.10)
Now, My code is using spark mllib. And accessing data from Hadoop file system with this code.
String finalData = ProjectProperties.hadoopBasePath + ProjectProperties.finalDataPath;
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), finalData).toJavaRDD();
With following properties set.
finalDataPath = /data/finalInput.txt
hadoopBasePath = hdfs://127.0.0.1:54310
I am starting the dfs nodes externally through command
start-dfs.sh
Now, my code works perfectly fine when running from eclipse. But if I export the whole code to an executable jar, it gives me following exception.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
I also checked different solutions online given for this issue where people are asking me to add following
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
OR
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
But I don't use any Hadoop context or hadoop config into my project. Simply load the data from Hadoop using the URL.
Can someone give some answer relevant to this issue?
Please mind that this totally works fine from Eclipse. And only doesn't work if I export the same project as an executable Jar.
Update
As suggested in the comment and from the solutions found online, I tried two things.
Added dependencies into my pom.xml for hadoop-core, hadoop-hdfs and hadoop-client libraries.
Added the above properties configuration to hadoop's site-core.xml as suggested here http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
But still no luck in getting the error resolved. Gives the same issue locally on my machine as well as one of the remote machines I tried it on.
I also installed hadoop the same way I did on my machine using the link mentioned above.
I was trying to create a java program to write/read files from HDFS.
I saw some examples of Java API. With this, the following code works for me.
Configuration mConfiguration = new Configuration();
mConfiguration.set(“fs.default.name”, “hdfs://NAME_NODE_IP:9000″);
But my set up has to be changed for a Hadoop HA set up so hardcoded namenode addressing is not possible.
I saw some example where in we provide the path of configuration xmls like below.
mConfiguration.addResource(new Path(“/usr/local/hadoop/etc/hadoop/core-site.xml”));
mConfiguration.addResource(new Path(“/usr/local/hadoop/etc/hadoop/hdfs-site.xml”));
This code also works when running the application in the same system as of hadoop.
But it would not work when my application is not running on the same m/c as of hadoop.
So, what is the approach I should take so that the system works but direct namenode addressing is not done.
Any help would be appreciated.
While using the Hadoop High Availability concept, you need to set following properties in configuration object:
Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://nameservice1");
conf.set("fs.default.name", conf.get("fs.defaultFS"));
conf.set("dfs.nameservices","nameservice1");
conf.set("dfs.ha.namenodes.nameservice1", "namenode1,namenode2");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode1","hadoopnamenode01:8020");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode2", "hadoopnamenode02:8020");
conf.set("dfs.client.failover.proxy.provider.nameservice1","org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
Try it out!