I am developing an application on Java Spark. Generated and successfully loaded the .jar to the EMR cluster. There is one line of the code that reads:
JsonReader jsonReader = new JsonReader(new FileReader("s3://naturgy-sabt-dev/QUERY/input.json"));
I am 100% sure of:
Such file does exist.
When executing aws s3 cp s3://naturgy-sabt-dev/QUERY/input.json . I'm receiving correctly the .json file.
IAM policies are set so that the tied EMR role has permissions to read, write and list.
This post about how to read from S3 in EMR does not help.
When submitting the spark jar, I am getting the following error:
(Note the printing of the route that it is going to be read right before calling the Java statement above put)
...
...
...
19/12/11 15:55:46 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.31.36.11, 35744, None)
19/12/11 15:55:46 INFO BlockManager: external shuffle service port = 7337
19/12/11 15:55:46 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.31.36.11, 35744, None)
19/12/11 15:55:48 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/local-1576079746613
19/12/11 15:55:48 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
#########################################
I am going to read from s3://naturgy-sabt-dev/QUERY/input.json
#########################################
java.io.FileNotFoundException: s3:/naturgy-sabt-dev/QUERY/input.json (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at java.io.FileReader.<init>(FileReader.java:58)
...
...
...
Does anyone know what's going on?
Thanks for any help you can provide.
Java default Filereader cannot load files from aws s3 by.
They can only be read with 3d party libs. The bare s3 reader is shipped within java aws sdk.
However hadoop has also libraries to read from s3. Hadoop jars are preinstalled on aws emr spark cluster (actually on almost all spark installs).
Spark supports loading data from s3 filesystem into a spark dataframe directly without any manual steps. All readers can read either one file, or multiple files with same structure via a glob pattern. The json dataframe reader expects new-line delimited json by default. This can be configured.
various usage ways
# read single new-line delimited json file, each line is a record
spark.read.json("s3://path/input.json")
# read single serilized json object or array, spanning multiple lines.
spark.read.option("multiLine", true).json("s3://path/input.json")
# read multiple json files
spark.read.json("s3://folder/*.json")
Related
I am trying to access hbase on EMR for read and write from a java application that is running outside EMR cluster nodes . ie;from a docker application running on ECS cluster/EC2 instance. The hbase root folder is like s3://<bucketname/. I need to get hadoop and hbase configuration objects to access the hbase data for read and write using the core-site.xml,hbase-site.xml files. I am able to access the same if hbase data is stored in hdfs.
But when it is hbase on S3 and try to achieve the same I am getting below exception.
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638
The core-site.xml file contains the the below properties.
<property>
<name>fs.s3.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
Below is the jar containing the “com.amazon.ws.emr.hadoop.fs.EmrFileSystem” class:
/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.44.0.jar
This jar is present only on emr nodes and cannot be included as a maven dependency in a java project from maven public repo. For Map/Reduce jobs and Spark jobs adding the jar location in the classpath will serve the purpose. For a java application running outside emr cluster nodes, adding the jar to the classpath won't work as the jar is not available in the ecs instances. Manually adding the jar to the classpath will lead to the below error.
2021-03-26 10:02:39.420 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from http://localhost:8321/configuration , trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json
2021-03-26 10:02:39.421 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json
2021-03-26 10:02:39.421 INFO 1 --- [ main] c.a.ws.emr.hadoop.fs.util.PlatformInfo : Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look
2021-03-26 10:02:45.578 WARN 1 --- [ main] c.a.w.e.h.fs.util.ConfigurationUtils : Cannot create temp dir with proper permission: /mnt/s3
We are using emr version 5.29. Is there any work around to solve the issue?
S3 isn't a "real" filesystem -it doesn't have two things hbase needs
atomic renames needed for compaction
hsync() to flush/sync the write ahead log.
To use S3 as the HBase back end
There's a filesystem wrapper around S3a, "HBoss" which does the locking needed for compaction.
you MUST still use HDFS or some other real FS for the WAL
Further reading [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md]
I was able to solve the issue by using s3a. EMRFS libs used in the emr are not public and cannot be used outside EMR. Hence I used S3AFileSystem to access hbase on S3 from my ecs cluster. Add hadoop-aws and aws-java-sdk-bundle maven dependencies corresponding to your hadoop version.
And add the below property in my core-site.xml.
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>
then change the hbase root directory url in hbase-site.xml as follows.
<property>
<name>hbase.rootdir</name>
<value>s3a://bucketname/</value>
</property>
You can also set the other s3a related properties. Please refer to the below link for more details related to s3a.
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
Got an old hadoop system (that haven't been used for years), when trying to restart the cluster (1 master, 2 slaves), all on Linux, got error, on the namenode.
Error output:
2021-03-18 20:18:28,628 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
java.io.IOException: Failed to load image from FSImageFile(file=/home/xxx/tmp/hadoop/name/current/fsimage_0000000000000480607, cpktTxId=0000000000000480607)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:651)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:264)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:627)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:469)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:437)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:594)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1169)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1235)
Caused by: java.io.IOException: No MD5 file found corresponding to image file /home/xxx/tmp/hadoop/name/current/fsimage_0000000000000480607
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:736)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:632)
... 9 more
2021-03-18 20:18:28,631 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2021-03-18 20:18:28,633 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
More info:
One of the slave's datanode's partition has bad disk blocks, so I have removed that partition from /etc/fstab so that to bring the Linux up. So, that slave's data is lost.
What I have tried:
Start the cluster, including the all 3 nodes, got above error.
Start the cluster, excluding the bad slave, thus only 2 nodes, still got above error.
Questions:
A. What the error means ?
B. Is it relevant to the bad slave?
C. Is there anyway to recover without re-format hdfs filesystem on namenode?
There should be a file called:
/home/xxx/tmp/hadoop/name/current/fsimage_0000000000000480607.md5
In the same location as the image file. It will have contents that look like this:
177e5f4ed0b7f43eb9e274903e069da4 *fsimage_0000000000000014367
Simply get the md5 sum of your fsimage file:
md5sum fsimage_0000000000000480607.md5
Then create a new md5 file that looks like:
xxxxxx *fsimage_0000000000000480607.md5
Where xxxxxx is the md5sum from the md5 command.
I am trying to use Kafka-Kinesis-Connector is a connector to be used with Kafka Connect to publish messages from Kafka to Amazon Kinesis Firehose, as mentioned in the link (https://github.com/awslabs/kinesis-kafka-connector) and getting a below error. I am using Cloudera version CDH-6.1.0-1.cdh6.1.0.p0.770702 and it ships with Kafka 2.1.2 (0.10.0.1+kafka2.1.2+6).
I have loaded the AWS credentials in the current sessions, This didn't work.
export AWS_ACCESS_KEY_ID="XXX"
export AWS_SECRET_ACCESS_KEY="YYYYY"
export AWS_DEFAULT_REGION="sssss"
My worker.properties as shown below
bootstrap.servers=kafkanode:9092
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
#internal.value.converter=org.apache.kafka.connect.storage.StringConverter
#internal.key.converter=org.apache.kafka.connect.storage.StringConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter.schemas.enable=true
internal.value.converter.schemas.enable=true
offset.storage.file.filename=offset.log
schemas.enable=false
#Rest API
rest.port=8096
plugin.path=/home/opc/kinesis-kafka-connector-master/target/
#rest.host.name=
My kinesis-firehose-kafka-connector.properties as shown below
name=kafka_kinesis_sink_connector
connector.class=com.amazon.kinesis.kafka.FirehoseSinkConnector
tasks.max=1
topics=OGGTest
region=eu-central-1
batch=true
batchSize=500
batchSizeInBytes=1024
deliveryStream=kafka-s3-stream
The Error code is as shown below:
[2019-01-26 11:32:24,446] INFO Kafka version : 2.0.0-cdh6.1.0 (org.apache.kafka.common.utils.AppInfoParser:109)
[2019-01-26 11:32:24,446] INFO Kafka commitId : unknown (org.apache.kafka.common.utils.AppInfoParser:110)
[2019-01-26 11:32:24,449] INFO Created connector kafka_kinesis_sink_connector (org.apache.kafka.connect.cli.ConnectStandalone:104)
[2019-01-26 11:32:25,296] ERROR WorkerSinkTask{id=kafka_kinesis_sink_connector-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:177)
com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:131)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1164)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:762)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:724)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.kinesisfirehose.AmazonKinesisFirehoseClient.doInvoke(AmazonKinesisFirehoseClient.java:826)
at com.amazonaws.services.kinesisfirehose.AmazonKinesisFirehoseClient.invoke(AmazonKinesisFirehoseClient.java:802)
at com.amazonaws.services.kinesisfirehose.AmazonKinesisFirehoseClient.describeDeliveryStream(AmazonKinesisFirehoseClient.java:451)
at com.amazon.kinesis.kafka.FirehoseSinkTask.validateDeliveryStream(FirehoseSinkTask.java:95)
at com.amazon.kinesis.kafka.FirehoseSinkTask.start(FirehoseSinkTask.java:77)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:301)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:190)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2019-01-26 11:32:25,299] ERROR WorkerSinkTask{id=kafka_kinesis_sink_connector-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:178)
[2019-01-26 11:32:33,375] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:65)
[2019-01-26 11:32:33,375] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:223)
Please advise. Thanks in advance!
The ~/.aws/credentials file located in the home directory of the operating system user that runs the Connect worker processes. These credentials are recognized by most AWS SDKs and the AWS CLI. Use the following AWS CLI command to create the credentials file:
aws configure
You can also manually create the credentials file using a text editor. The file should contain lines in the following format:
[default]
aws_access_key_id =
aws_secret_access_key =
NOTE : When creating the credentials file, make sure that the user creating the credentials file is the same user that runs the Connect worker processes and that the credentials file is in this user's home directory. Otherwise, the S3 connector will not be able to find the credentials.
I am trying to run a spark job in EMR cluster.
I my spark-submit I have added configs to read from log4j.properties
--files log4j.properties --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/log4j.properties"
Also I have added
log4j.rootLogger=INFO, file
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=/log/test.log
log4j.appender.file.MaxFileSize=10MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %5p %c{7} - %m%n
in my log4j configurations.
Anyhow I see the logs in the console, though I don't see the log file generated. What am I doing wrong here ?
Quoting spark-submit --help:
--files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).
That doesn't much say what to do with the FILES if you cannot use SparkFiles.get(fileName) (which you cannot for log4j).
Quoting SparkFiles.get's scaladoc:
Get the absolute path of a file added through SparkContext.addFile().
That does not give you much either, but suggest to have a look at the source code of SparkFiles.get:
def get(filename: String): String =
new File(getRootDirectory(), filename).getAbsolutePath()
The nice thing about it is that getRootDirectory() uses an optional property or just the current working directory:
def getRootDirectory(): String =
SparkEnv.get.driverTmpDir.getOrElse(".")
That gives as something to work on, doesn't it?
On the driver the so-called driverTmpDir directory should be easy to find in Environment tab of web UI (under Spark Properties for spark.files property or Classpath Entries marked as "Added By User" in Source column).
On executors, I'd assume a local directory so rather than using file:/log4j.properties I'd use
-Dlog4j.configuration=file://./log4j.properties
or
-Dlog4j.configuration=file:log4j.properties
Note the dot to specify the local working directory (in the first option) or no leading / (in the latter).
Don't forget about spark.driver.extraJavaOptions to set the Java options for the driver if that's something you haven't thought about yet. You've been focusing on executors only so far.
You may want to add -Dlog4j.debug=true to spark.executor.extraJavaOptions that is supposed to print what locations log4j uses to find log4j.properties.
I have not checked that answer on a EMR or YARN cluster myself but believe that may have given you some hints where to find the answer. Fingers crossed!
With Spark 2.2.0 standalone cluster, executor JVM is started first and only then Spark distributes application jar and --files
Which means passing
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j-spark.xml
does not makes sense as this file does not exist yet (is not downloaded) at the executor JVM launch time and log4j initialization
If you pass
spark.executor.extraJavaOptions=-Dlog4j.debug -Dlog4j.configuration=file:log4j-spark.xml
you will find at the beginning of the executor's stderr failed attempt to load log4j config file
log4j:ERROR Could not parse url [file:log4j-spark.xml].
java.io.FileNotFoundException: log4j-spark.xml (No such file or directory)
...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
And bit later is logged download of --files from driver
18/07/18 17:24:12 INFO Utils: Fetching spark://123.123.123.123:35171/files/log4j-spark.xml to /ca/tmp-spark/spark-49815375-3f02-456a-94cd-8099a0add073/executor-7df1c819-ffb7-4ef9-b473-4a2f7747237a/spark-0b50a7b9-ca68-4abc-a05f-59df471f2d16/fetchFileTemp5898748096473125894.tmp
18/07/18 17:24:12 INFO Utils: Copying /ca/tmp-spark/spark-49815375-3f02-456a-94cd-8099a0add073/executor-7df1c819-ffb7-4ef9-b473-4a2f7747237a/spark-0b50a7b9-ca68-4abc-a05f-59df471f2d16/-18631083971531927447443_cache to /opt/spark-2.2.0-bin-hadoop2.7/work/app-20180718172407-0225/2/./log4j-spark.xml
It may work differently with yarn or another cluster manager but with standalone cluster, it seems there is no way you can specify your own logging configuration for executors on spark-submit.
You can dynamically reconfigure log4j in your job code (override log4j configuration programmatically: file location for FileAppender), but you would need to do it carefully in some mapPartition lambda that is executed in executor's JVM. Or maybe you can dedicate first stage of your job to it. All that sucks though...
Here is the complete command I used to run my uber-jar in EMR and I see log files generated in driver and executor nodes.
spark-submit --class com.myapp.cloud.app.UPApp --master yarn --deploy-mode client --driver-memory 4g --executor-memory 2g --executor-cores 8 --files log4j.properties --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties -Dlog4j.debug=true" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" --conf "spark.eventLog.dir=/mnt/var/log/" uber-up-0.0.1.jar
where log4j.properties is in my local filesystem.
I have just started using Brooklyn and I am trying to get the example from the deploying blueprints page working fully through my AWS account.
The Maven build completed successfully and I can successfully launch the Brooklyn Web UI from ~/apache-brooklyn-0.7.0-M2-incubating/usage/dist/target/brooklyn-dist using the steps on the running Brooklyn page.
When I launch the blueprint, I can see all the VMs launching in my AWS Console UI. I can also see the key pairs and security groups created. But the blueprint eventually fails because (I believe) Brooklyn cannot ssh into the VMs, see the first log output below. I assume Brooklyn attempts to login to the VMs using the created key pairs somehow?
Based on the info in the locations page, I created a ~/.brooklyn/brooklyn.properties file and added the following configuration:
brooklyn.location.jclouds.aws-ec2.identity = MyAwsAccessKeyID
brooklyn.location.jclouds.aws-ec2.credential = MyAwsSecretAccessKey
brooklyn.location.jclouds.aws-ec2.privateKeyFile = /home/username/key4brooklyn.pem
I created the key4brooklyn.pemfile from the AWS Console UI and restarted Brooklyn however the blueprint still does not work, it creates the VMs but cannot access the VMs, see log output below.
2015-03-02 23:31:27,295 INFO Starting MySqlNodeImpl{id=lzJhHxwD}, obtaining a new location instance in JcloudsLocation[aws-ec2:MyAwsAccessKeyID/aws-ec2] with ports [22, 3306]
2015-03-02 23:31:27,369 INFO Starting NginxControllerImpl{id=QYRLgQPh}, obtaining a new location instance in JcloudsLocation[aws-ec2:MyAwsAccessKeyID/aws-ec2] with ports [22, 8000]
2015-03-02 23:31:27,612 INFO Resize DynamicWebAppClusterImpl{id=iJNs2ltC} from 0 to 1
2015-03-02 23:31:28,830 INFO Starting JBoss7ServerImpl{id=MWMGwHXx}, obtaining a new location instance in JcloudsLocation[aws-ec2:MyAwsAccessKeyID/aws-ec2] with ports [22, 9443, 10999, 8443, 8080, 9990]
2015-03-02 23:31:37,870 INFO Creating VM aws-ec2#MySqlNodeImpl{id=lzJhHxwD} in JcloudsLocation[aws-ec2:MyAwsAccessKeyID/aws-ec2]
2015-03-02 23:31:38,508 INFO Creating VM aws-ec2#JBoss7ServerImpl{id=MWMGwHXx} in JcloudsLocation[aws-ec2:MyAwsAccessKeyID/aws-ec2]
2015-03-02 23:31:38,983 INFO Creating VM aws-ec2#NginxControllerImpl{id=QYRLgQPh} in JcloudsLocation[aws-ec2:MyAwsAccessKeyID/aws-ec2]
2015-03-02 23:34:55,349 INFO Not able to load publicKeyData from inferred files, based on privateKeyFile: tried [/home/username/key4brooklyn.pem.pub] for aws-ec2#MySqlNodeImpl {id=lzJhHxwD}
2015-03-02 23:34:55,353 INFO Not able to load publicKeyData from inferred files, based on privateKeyFile: tried [/home/username/key4brooklyn.pem.pub] for aws-ec2#JBoss7ServerImpl {id=MWMGwHXx}
2015-03-02 23:34:55,351 INFO Not able to load publicKeyData from inferred files, based on privateKeyFile: tried [/home/username/key4brooklyn.pem.pub] for aws-ec2#NginxControllerImpl {id=QYRLgQPh}
I am using Ubuntu 14.04 with Oracle Java 7 installed, it is a VirtualBox VM.
Looking at the log output, the problem is here:
2015-03-02 23:34:55,349 INFO Not able to load publicKeyData from inferred files, based on privateKeyFile: tried [/home/username/key4brooklyn.pem.pub] for aws-ec2#MySqlNodeImpl {id=lzJhHxwD}
The privateKeyFile configuration key needs to specify an id_rsa or id_dsa style key pair in two files. The corresponding *.pub file will be auto-detected if publicKeyFile is not configured. There are better instructions for creating an ssh key available. It is confusing and better error reporting around keys (including fail-fast) is in the latest SNAPSHOT builds and will be included in the M3 milestone release. Also note that the id_rsa file must have one and only one private key and must not contain the public key. Tedious that there are so many formats!
The ~/.ssh/id_rsa or other configured key-pair is just used by Brooklyn for setting up ssh access to the VM after it is provisioned. By default, jclouds (which we use under the covers) will create a temporary AWS key-pair to get initial access to the VM. We'll then automatically add the ~/.ssh/id_rsa.pub to the VM's ~/.ssh/authorized_keys (creating a user on the VM that by default has the same name as the user who is running the Brooklyn process).
The key4brooklyn.pem file you downloaded is the private part of the AWS key-pair. By default, this will not be used because jclouds will create its own key-pair.
If you wanted jclouds to use your pre-existing key pair then you'd have to use the following configuration setting:
brooklyn.location.jclouds.aws-ec2.keyPair = MyKeypairName
Where MyKeypairName is the name of the key-pair according to AWS.