How to access Hbase on S3 in from non EMR node

How to access Hbase on S3 in from non EMR node - java

I am trying to access hbase on EMR for read and write from a java application that is running outside EMR cluster nodes . ie;from a docker application running on ECS cluster/EC2 instance. The hbase root folder is like s3://<bucketname/. I need to get hadoop and hbase configuration objects to access the hbase data for read and write using the core-site.xml,hbase-site.xml files. I am able to access the same if hbase data is stored in hdfs.
But when it is hbase on S3 and try to achieve the same I am getting below exception.
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638
The core-site.xml file contains the the below properties.
<property>
<name>fs.s3.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
Below is the jar containing the “com.amazon.ws.emr.hadoop.fs.EmrFileSystem” class:
/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.44.0.jar
This jar is present only on emr nodes and cannot be included as a maven dependency in a java project from maven public repo. For Map/Reduce jobs and Spark jobs adding the jar location in the classpath will serve the purpose. For a java application running outside emr cluster nodes, adding the jar to the classpath won't work as the jar is not available in the ecs instances. Manually adding the jar to the classpath will lead to the below error.
2021-03-26 10:02:39.420 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from http://localhost:8321/configuration , trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json
2021-03-26 10:02:39.421 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json
2021-03-26 10:02:39.421 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look
2021-03-26 10:02:45.578 WARN 1 --- [ main] c.a.w.e.h.fs.util.ConfigurationUtils : Cannot create temp dir with proper permission: /mnt/s3
We are using emr version 5.29. Is there any work around to solve the issue?

S3 isn't a "real" filesystem -it doesn't have two things hbase needs
atomic renames needed for compaction
hsync() to flush/sync the write ahead log.
To use S3 as the HBase back end
There's a filesystem wrapper around S3a, "HBoss" which does the locking needed for compaction.
you MUST still use HDFS or some other real FS for the WAL
Further reading [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md]

I was able to solve the issue by using s3a. EMRFS libs used in the emr are not public and cannot be used outside EMR. Hence I used S3AFileSystem to access hbase on S3 from my ecs cluster. Add hadoop-aws and aws-java-sdk-bundle maven dependencies corresponding to your hadoop version.
And add the below property in my core-site.xml.
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>
then change the hbase root directory url in hbase-site.xml as follows.
<property>
<name>hbase.rootdir</name>
<value>s3a://bucketname/</value>
</property>
You can also set the other s3a related properties. Please refer to the below link for more details related to s3a.
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

Related

Problems reading from EMR cluster in S3

I am developing an application on Java Spark. Generated and successfully loaded the .jar to the EMR cluster. There is one line of the code that reads:
JsonReader jsonReader = new JsonReader(new FileReader("s3://naturgy-sabt-dev/QUERY/input.json"));
I am 100% sure of:
Such file does exist.
When executing aws s3 cp s3://naturgy-sabt-dev/QUERY/input.json . I'm receiving correctly the .json file.
IAM policies are set so that the tied EMR role has permissions to read, write and list.
This post about how to read from S3 in EMR does not help.
When submitting the spark jar, I am getting the following error:
(Note the printing of the route that it is going to be read right before calling the Java statement above put)
...
...
...
19/12/11 15:55:46 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.31.36.11, 35744, None)
19/12/11 15:55:46 INFO BlockManager: external shuffle service port = 7337
19/12/11 15:55:46 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.31.36.11, 35744, None)
19/12/11 15:55:48 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/local-1576079746613
19/12/11 15:55:48 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
#########################################
I am going to read from s3://naturgy-sabt-dev/QUERY/input.json
#########################################
java.io.FileNotFoundException: s3:/naturgy-sabt-dev/QUERY/input.json (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at java.io.FileReader.<init>(FileReader.java:58)
...
...
...
Does anyone know what's going on?
Thanks for any help you can provide.

Java default Filereader cannot load files from aws s3 by.
They can only be read with 3d party libs. The bare s3 reader is shipped within java aws sdk.
However hadoop has also libraries to read from s3. Hadoop jars are preinstalled on aws emr spark cluster (actually on almost all spark installs).
Spark supports loading data from s3 filesystem into a spark dataframe directly without any manual steps. All readers can read either one file, or multiple files with same structure via a glob pattern. The json dataframe reader expects new-line delimited json by default. This can be configured.
various usage ways
# read single new-line delimited json file, each line is a record
spark.read.json("s3://path/input.json")
# read single serilized json object or array, spanning multiple lines.
spark.read.option("multiLine", true).json("s3://path/input.json")
# read multiple json files
spark.read.json("s3://folder/*.json")

Flyway (SpringBoot) migration with file in multiple directories skip version

I am working on a SpringBoot application with Flyway. I have to update the database that already has those migration :
The migrations under common must be executed on different environements (Spring profiles loaded) while local and qa will have different data inserted into a H2 database.
I need to alter the table (adding and modifying columns) and then update the data inserted in V1_1 and V1_2. I tried MANY different approaches to avoid putting the ALTER TABLE sql command in the local and qa migration files. I would like to leaver the ALTER TABLE commands in the common folder, while in the local and qa folder only have the update commands. But all of them were in vain, the new migration I add in the local directory always gets executed before the one I add in the common repository :
Even with the naming scheme above, V1_4 gets executed before V1_3, causing an error because the new columns were not added yet. I know this is not the perfect naming scheme, I used it mostly for testing and expressing my point. But even while manually testing, flyway does not behave like I would expect (surely because of my misunderstaing). The app log clearly shows V1_3 not being executed :
2019-08-13 13:31:04.025 INFO 26508 --- [ main] o.f.core.internal.command.DbMigrate : Migrating schema "PUBLIC" to version 1.0 - schema
2019-08-13 13:31:04.076 INFO 26508 --- [ main] o.f.core.internal.command.DbMigrate : Migrating schema "PUBLIC" to version 1.1 - institutions
2019-08-13 13:31:04.092 INFO 26508 --- [ main] o.f.core.internal.command.DbMigrate : Migrating schema "PUBLIC" to version 1.2 - data
2019-08-13 13:31:04.476 INFO 26508 --- [ main] o.f.core.internal.command.DbMigrate : Migrating schema "PUBLIC" to version 1.4 - update data
2019-08-13 13:31:04.482 ERROR 26508 --- [ main] o.f.core.internal.command.DbMigrate : Migration of schema "PUBLIC" to version 1.4 - update data failed! Please restore backups and roll back database and code!
I am using this propertiy : spring.flyway.locations=classpath:db/migration/common,classpath:db/migration/local
in the environnement where the exception occurs.
What am I doing wrong ? I can't seem to find a lot of documentation on flyway migration with file in multiple directories. Unfortunatly, this is what I am stuck with and I cannot change the file structure since these decision are out of my hands.
Thanks in advance !

When you provide a location for your migration, flyway looks for .sql files under that folder and subfolders. So if you have V1.1 and V1.4 in Local and V1.3 in Common, flyway will treat Local as the migration folder and execute in the order V1.1 and V1.4, it won't go to common unless you provide root dir as your flyway location. In your case you should give db.migration as your location.

Hive: Unable to create external tables for existed data in HDFS

Update 1:
Modified the version of hadoop to 2.x but the error is still there.
Original:
I generated tpcds test data into Ceph with hive-testbench.
Currently, the data is located at root directory of storage system, and is in folder tpcds.
For example, the result of hdfs dfs -ls / is
drwxrwxrwx - root root 0 2019-08-05 10:44 /hive
drwxrwxrwx - root root 0 2019-08-05 10:44 /tmp
drwxrwxrwx - root root 0 2019-08-05 10:44 /tpcds
drwxrwxrwx - root root 0 2019-08-05 10:44 /user
drwxrwxrwx - root root 0 2019-08-05 10:44 /warehouse
The result of s3cmd ls s3://tpcds is:
DIR s3://tpcds/hive/
DIR s3://tpcds/tmp/
DIR s3://tpcds/tpcds/
DIR s3://tpcds/user/
DIR s3://tpcds/warehouse/
For s3cmd ls s3://tpcds, the bucket name is tpcds.
When the data is ready, the next setup is to create external table in Hive to get access to those data. The reason why I show the storage layout is to make sure by you guys that the issue I met has nothing to do with the path.
The command used is hive -i settings/load-flat.sql -f ddl-tpcds/text/alltables.sql -d DB=tpcds_text_7 -d LOCATION=tpcds/7, however, I met issue below:
exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Exception thrown flushing changes to datastore)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:433)
at org.apache.hadoop.hive.ql.exec.DDLTask.createDatabase(DDLTask.java:4243)
For the stack version: Hive 2.3.2, Hadoop 3.1.2.
Currently, the most possible reason from my side is about the hadoop version, I'm going to degrade it to hadoop 2.7 to see if same error occurs.
And at the same time, any comment is welcomed. Thanks for your help in advance.

Since the issue is solved, I'd like to post the solution here for further visitors who might meet same issue.
The hive I used to initialize schema of mysql metastore is 3.1.1. After that, I just replaced the hive folder with hive 2.3.2. This kind of downgrade is not graceful, the metastore created before is not consistent with hive 2.3.2 and that's the reason why I met the issue.
I reverted the hive folder to 3.1.1 and everything turns fine.

NodeManager and ResourceManager processes do not start

I am setting up a Multi-Node cluster and my NodeManager and ResourceManager processes are not starting for some reason and I can't figure out why. When I run the jps command, I only see the NameNode and SecondaryNameNode and JPS processes. As a result, my MapReduce job won't work. This is my configuration
yarn-site.xml - across NameNode and DataNodes
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ec2PathToMyNameNode.compute-1.amazonaws.com</value>
</property>
</configuration>
And my hosts file is this on the NameNode:
nameNodeIP nameNodePublicDNS.compute-1.amazonaws.com
dataNode1IP dataNode1PublicDNS.compute-1.amazonaws.com
dataNode2IP dataNode2PublicDNS.compute-1.amazonaws.com
dataNode3IP dataNode3PublicDNS.compute-1.amazonaws.com
127.0.0.1 localhost
When I run my MapReduce job it says it's unable to connect at 8032. I am using Hadoop 3.1.2
Edit:
I Checked the logs and i found the following exception:
Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:190)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:499)
... 83 more
Error injecting constructor, java.lang.NoClassDefFoundError: javax/activation/DataSource
at org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver.(JAXBContextResolver.java:41)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebApp.setup(RMWebApp.java:54)
while locating org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver
1 error
at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1025)
at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051)
at com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory$GuiceInstantiatedComponentProvider.getInstance(GuiceComponentProviderFactory.java:345)

Trying to figure out the issue
(1) Start-dfs.sh vs Start-all.sh
Check that you are using Start-all.sh command when you are trying to start hadoop because Start-dfs.sh will only start the namenode and datanodes
(2) Check the Hadoop logs
Check for the HADOOP_LOG_DIR global variable value to get the Log dir, because it will include all exception thrown when trying to start the Namenode Manager and the Resource Manager
(3) Check for the installed Java version
The error may be thrown by an incompatible Java version, check that you have installed the latest Java version.
Fix Java 9 incompatibilies in Hadoop
Hadoop Error starting ResourceManager and NodeManager
(4) Check Hadoop Common issues
Based on the error you provided in the answer update you may find these issue links relevant:
[JDK9] Fail to run yarn application after building hadoop pkg with jdk9 in jdk9 env
[JDK9] Resource Manager failed to start after using hadoop pkg(built with jdk9)
More information
For more information you can check my article on Medium, it may give you some insights:
Installing Hadoop 3.1.0 multi-node cluster on Ubuntu 16.04 Step by Step

my problem is that I used java11 to cooperate with hadoop.
so what i do is
1.rm /Library/Java/*
2.download java8 from https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
3.install java8jdk and
4.fix the JAVA_HOME in hadoop-env.sh
5.stop-all.sh
6.start-dfs.sh
7.start-yarn.sh

[pdash#localhost hadoop]$ export YARN_RESOURCEMANAGER_OPTS="--add-modules=ALL-SYSTEM"
[pdash#localhost hadoop]$ export YARN_NODEMANAGER_OPTS="--add-modules=ALL-SYSTEM"
It will work for sure I tried from apache JIRA log ....Thank PRAFUL

Initialization failed for Block pool <registering> (Datanode Uuid unassigned)

What is the source of this error and how could it be fixed?
2015-11-29 19:40:04,670 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to anmol-vm1-new/10.0.1.190:8020. Exiting.
java.io.IOException: All specified directories are not accessible or do not exist.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:217)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:254)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:974)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:945)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:278)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:220)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816)
at java.lang.Thread.run(Thread.java:745)
2015-11-29 19:40:04,670 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to anmol-vm1-new/10.0.1.190:8020
2015-11-29 19:40:04,771 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)

there are 2 Possible Solutions to resolve
First:
Your namenode and datanode cluster ID does not match, make sure to make them the same.
In name node, change ur cluster id in the file located in:
$ nano HADOOP_FILE_SYSTEM/namenode/current/VERSION
In data node you cluster id is stored in the file:
$ nano HADOOP_FILE_SYSTEM/datanode/current/VERSION
Second:
Format the namenode at all:
Hadoop 1.x: $ hadoop namenode -format
Hadoop 2.x: $ hdfs namenode -format

I met the same problem and solved it by doing the following steps:
step 1. remove the hdfs directory (for me it was the default directory "/tmp/hadoop-root/")
rm -rf /tmp/hadoop-root/*
step 2. run
bin/hdfs namenode -format
to format the directory

The root cause of this is datanode and namenode clusterID different, please unify them with namenode clusterID then restart hadoop then it should be resolved.

The issue arises because of mismatch of cluster ID's of datanode and namenode.
Follow these steps:
GO to Hadoop_home/data/namenode/CURRENT and copy cluster ID from "VERSION".
GO to Hadoop_home/data/datanode/CURRENT and paste this cluster ID in "VERSION" replacing the one present there.
then format the namenode
start datanode and namenode again.

The issue arises because of mismatch of cluster ID's of datanode and namenode.
Follow these steps:
1- GO to Hadoop_home/ delete folder Data
2- create folder with anthor name data123
3- create two folder namenode and datanode
4-go to hdfs-site and past your path
<name>dfs.namenode.name.dir</name>
<value>........../data123/namenode</value>
<name>dfs.datanode.data.dir</name>
<value>............../data123/datanode</value>
.

This problem may occur when there are some storage i/o errors. In this scenario, the VERSION file is not available hence appearing as the error above.
You may need to exclude the storage locations on those bad drives in hdfs-site.xml.

For me, this worked -
Delete (or make a backup) of HADOOP_FILE_SYSTEM/namenode/current directory
restart the datanode service
This should create the current directory again, with the correct clusterID in the VERSION file
Source - https://community.pivotal.io/s/article/Cluster-Id-is-incompatible-error-reported-when-starting-datanode-service?language=en_US

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.