Write to dynamic partition Java-Spark

Write to dynamic partition Java-Spark - java

I've created the following table in Hive:
CREATE TABLE mytable (..columns...) PARTITIONED BY (load_date string) STORED AS ...
And I'm trying to insert data to my table with spark as follow:
Dataset<Row> dfSelect = df.withColumn("load_date","15_07_2018");
dfSelect.write().mode("append").partitionBy("load_date").save(path);
And also make the following configuration:
sqlContext().setConf("hive.exec.dynamic.partition","true");
sqlContext().setConf("hive.exec.dynamic.partition.mode","nonstrict");
And after I make the write command I see on HDFS the directory /myDbPath/load_date=15_07_2018, which contains the file that I've written but when I make query like:
show partitions mytable
or
select * from mytable where load_date="15_07_2018"
I get 0 records.
What happened and how can I fix this?
EDIT
If I run the following command in Hue:
msck repair table mytable
I solve the problem, how can I do it in my code?

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command (or) .save..etc), the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of the below commands
Meta store check command (msck repair table)
msck repair table <db.name>.<table_name>;
(or)
ALTER TABLE table_name ADD PARTITION commands on each of the newly added partitions.
We can also add partition by using alter table statement by using this way we need to add each and every newly created partition to the table
alter table <db.name>.<table_name> add partition(load_date="15_07_2018") location <hdfs-location>;
Run either of the above statements and then check the data again for load_date="15_07_2018"
For more details refer these links add partitions and msck repair table

Related

Out of memory when launching liquibase command "dropAllForeignKey"

I'm running the liquibase command "dropAllForeignKey" on Sybase database with more than 12000 tables and more than 380000 columns. I'm getting an out of memory exception since liquibase code is trying to query all the columns in the data base.
The JVM is launched with : -Xms64M -Xmx512M (if I increase it to 5GO it'll work but I don't see why we have to query all the columns in the data base)
The script I'm using :
<dropAllForeignKeyConstraints baseTableName="Table_Name"/>
When I checked liquibase code I found that:
In DropAllForeignKeyConstraintsChange: we create a snapshot for the table mentioned in the xml
Table target = SnapshotGeneratorFactory.getInstance().createSnapshot(
new Table(catalogAndSchema.getCatalogName(), catalogAndSchema.getSchemaName(),
database.correctObjectName(getBaseTableName(), Table.class))
, database);
In JdbcDatabaseSnapshot: when we call getColumns, we call the bulkFetchQuery() instead of fastFetchQuery() because the table is neither "DatabaseChangeLogTableName" nor "DatabaseChangeLogLockTableName". In this case, the bulkFetchQuery does not filter on the table given in the dropAllForeignKey xml. Instead, it uses SQL_FILTER_MATCH_ALL, so it'll retrieve all the columns in the database. (It already takes time to query all the columns)
In ColumnMapRowMapper: for each table, we create a LinkedHashMap with size aqual to the number of columns. And here, I'm getting the out of memory
Is it normal that we query all the column when dropping all the foreign keys for a given table? If it's the case, why we need to do it and is there a solution for my problem without increasing the size of the JVM?
PS: There is another command called dropForeignKey to drop the forign key but it needs the name of the foreign key as an input and I don't have it. In fact, I can find the name of the foreign key for a given data base, but I'm running this command on different data bases and the name of the foreign key changes from one to another and I need to have a generic liquibase change. So, I can't use dropForeignKey and I need to use dropAllForeignKey.
Here the stack:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
at java.base/java.util.HashMap.putVal(HashMap.java:637)
at java.base/java.util.HashMap.put(HashMap.java:607)
at liquibase.executor.jvm.ColumnMapRowMapper.mapRow(ColumnMapRowMapper.java:35)
at liquibase.executor.jvm.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:72)
at liquibase.snapshot.ResultSetCache$ResultSetExtractor.extract(ResultSetCache.java:297)
at liquibase.snapshot.JdbcDatabaseSnapshot$CachingDatabaseMetaData$3.extract(JdbcDatabaseSnapshot.java:774)
at liquibase.snapshot.ResultSetCache$ResultSetExtractor.extract(ResultSetCache.java:288)
at liquibase.snapshot.JdbcDatabaseSnapshot$CachingDatabaseMetaData$3.bulkFetchQuery(JdbcDatabaseSnapshot.java:606)
at liquibase.snapshot.ResultSetCache$SingleResultSetExtractor.bulkFetch(ResultSetCache.java:353)
at liquibase.snapshot.ResultSetCache.get(ResultSetCache.java:59)
at liquibase.snapshot.JdbcDatabaseSnapshot$CachingDatabaseMetaData.getColumns(JdbcDatabaseSnapshot.java:539)
at liquibase.snapshot.jvm.ColumnSnapshotGenerator.addTo(ColumnSnapshotGenerator.java:106)
at liquibase.snapshot.jvm.JdbcSnapshotGenerator.snapshot(JdbcSnapshotGenerator.java:79)
at liquibase.snapshot.SnapshotGeneratorChain.snapshot(SnapshotGeneratorChain.java:49)
at liquibase.snapshot.DatabaseSnapshot.include(DatabaseSnapshot.java:286)
at liquibase.snapshot.DatabaseSnapshot.init(DatabaseSnapshot.java:102)
at liquibase.snapshot.DatabaseSnapshot.<init>(DatabaseSnapshot.java:59)
at liquibase.snapshot.JdbcDatabaseSnapshot.<init>(JdbcDatabaseSnapshot.java:38)
at liquibase.snapshot.SnapshotGeneratorFactory.createSnapshot(SnapshotGeneratorFactory.java:217)
at liquibase.snapshot.SnapshotGeneratorFactory.createSnapshot(SnapshotGeneratorFactory.java:246)
at liquibase.snapshot.SnapshotGeneratorFactory.createSnapshot(SnapshotGeneratorFactory.java:230)
at liquibase.change.core.DropAllForeignKeyConstraintsChange.generateChildren(DropAllForeignKeyConstraintsChange.java:90)
at liquibase.change.core.DropAllForeignKeyConstraintsChange.generateStatements(DropAllForeignKeyConstraintsChange.java:59)

Using Spark Cassandra java Connector to append a table

javaFunctions(recomm).writerBuilder("recommender", "recommendations_wkg",
mapToRow(Recommendations_wkg.class))
.saveToCassandra();
The code insert into the table but won't update it. If the column exist it inserts a new one. I want to update if I have different info.

Can't get existing table from embedded hsqldb

To use embedded database I have created hsqldb database connected with java by this tutorial.
In general it is about creating simple table with 3 records through hsqldb manager and connect to this database with java.
Database was created and after exit the manager and connect again I gained my tables. They are recorded in test.script file.
If I try to connect with java by
connection = DriverManager.getConnection("jdbc:hsqldb:file:/db/test;ifexists=true", "SA", "");
then I got connection and can read all tables from it by
resultSet = statement.executeQuery("SELECT TABLE_NAME, COLUMN_NAME, TYPE_NAME, COLUMN_SIZE, DECIMAL_DIGITS, IS_NULLABLE FROM INFORMATION_SCHEMA.SYSTEM_COLUMNS WHERE TABLE_NAME NOT LIKE 'SYSTEM_%'");
But I can't get previous created table in manager even if script file contains that table.
This snippet of create table is placed in test.script file:
CREATE MEMORY TABLE PUBLIC.SALARYDETAILS(EMPID VARCHAR(6) PRIMARY KEY,SALARY INTEGER NOT NULL,BONUS INTEGER NOT NULL,INCREMENT INTEGER NOT NULL)
I was confused by MEMORY command, but maybe it is not what I should to fix. After remove it manager add it there again.
----- update 1 -----
SHUTDOWN command didn't help.
I don't know how HSQLDB store data, but thought that when they are in script file, it is done.
Exception what is see in the command line when running Java is
user lacks privilege or object not found
----- update 2 -----
I have created a table in Java and record data into it and I was available to get these data, but can't see it in the manager. It seems to me like different database, but location is same.
Script file doesn't contain new table and data. I don't know where data are stored.

It seems the database file was not saved. Before you exit the manager, execute this SQL command:
SHUTDOWN

JDBC - PostgreSQL - batch insert + unique index

I have a table with unique constraint on some field. I need to insert a large number of records in this table. To make it faster I'm using batch update with JDBC (driver version is 8.3-603).
Is there a way to do the following:
every batch execute I need to write into the table all the records from the batch that don't violate the unique index;
every batch execute I need to receive the records from the batch that were not inserted into DB, so I could save "wrong" records
?

The most efficient way of doing this would be something like this:
create a staging table with the same structure as the target table but without the unique constraint
batch insert all rows into that staging table. The most efficient way is to use copy or use the CopyManager (although I don't know if that is already supported in your ancient driver version.
Once that is done you copy the valid rows into the target table:
insert into target_table(id, col_1, col_2)
select id, col_1, col_2
from staging_table
where not exists (select *
from target_table
where target_table.id = staging_table.id);
Note that the above is not concurrency safe! If other processes do the same thing you might still get unique key violations. To prevent that you need to lock the target table.
If you want to remove the copied rows, you could do that using a writeable CTE:
with inserted as (
insert into target_table(id, col_1, col_2)
select id, col_1, col_2
from staging_table
where not exists (select *
from target_table
where target_table.id = staging_table.id)
returning staging_table.id;
)
delete from staging_table
where id in (select id from inserted);
A (non-unique) index on the staging_table.id should help for the performance.

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?

My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.