Change output filename prefix for DataFrame.write()

Change output filename prefix for DataFrame.write() - java

Output files generated via the Spark SQL DataFrame.write() method begin with the "part" basename prefix. e.g.
DataFrame sample_07 = hiveContext.table("sample_07");
sample_07.write().parquet("sample_07_parquet");
Results in:
hdfs dfs -ls sample_07_parquet/
Found 4 items
-rw-r--r-- 1 rob rob 0 2016-03-19 16:40 sample_07_parquet/_SUCCESS
-rw-r--r-- 1 rob rob 491 2016-03-19 16:40 sample_07_parquet/_common_metadata
-rw-r--r-- 1 rob rob 1025 2016-03-19 16:40 sample_07_parquet/_metadata
-rw-r--r-- 1 rob rob 17194 2016-03-19 16:40 sample_07_parquet/part-r-00000-cefb2ac6-9f44-4ce4-93d9-8e7de3f2cb92.gz.parquet
I would like to change the output filename prefix used when creating a file using Spark SQL DataFrame.write(). I tried setting the "mapreduce.output.basename" property on the hadoop configuration for the Spark context. e.g.
public class MyJavaSparkSQL {
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("MyJavaSparkSQL");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
ctx.hadoopConfiguration().set("mapreduce.output.basename", "myprefix");
HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(ctx.sc());
DataFrame sample_07 = hiveContext.table("sample_07");
sample_07.write().parquet("sample_07_parquet");
ctx.stop();
}
That did not change the output filename prefix for the generated files.
Is there a way to override the output filename prefix when using the DataFrame.write() method?

You cannot change the "part" prefix while using any of the standard output formats (like Parquet). See this snippet from ParquetRelation source code:
private val recordWriter: RecordWriter[Void, InternalRow] = {
val outputFormat = {
new ParquetOutputFormat[InternalRow]() {
// ...
override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
// ..
// prefix is hard-coded here:
new Path(path, f"part-r-$split%05d-$uniqueWriteJobId$bucketString$extension")
}
}
}
If you really must control the part file names, you'll probably have to implement a custom FileOutputFormat and use one of Spark's save methods that accept a FileOutputFormat class (e.g. saveAsHadoopFile).

Assuming that the output folder have only one csv file in it, we can rename this grammatically (or dynamically) using the below code. In the below code (last line), get all files from the output directory with csv type and rename that to a desired file name.
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val outputfolder_Path = "s3://<s3_AccessKey>:<s3_Securitykey>#<external_bucket>/<path>"
val fs = FileSystem.get(new java.net.URI(outputfolder_Path), new Configuration())
fs.globStatus(new Path(outputfolder_Path + "/*.*")).filter(_.getPath.toString.split("/").last.split("\\.").last == "csv").foreach{l=>{ fs.rename(new Path(l.getPath.toString), new Path(outputfolder_Path + "/DesiredFilename.csv")) }}

Agree with #Tzach Zohar..
After saving your dataframe in to HDFS or S3 you can rename using below...
The below scala example is ready to eat :-) means you can directly use in your code or util
After writing in to HDFS or S3 you can rename files using the below def..
#Brief :
1) get all the files under a folder using globstatus.
2) loop through and rename the file with prefix or suffix what ever is your case.
Note : Apache Commons are already available in hadoop clusters so no need for any further dependencies.
/**
* prefixHdfsFiles
* #param outputfolder_Path
* #param prefix
*/
def prefixHdfsFiles(outputfolder_Path: String, prefix: String) = {
import org.apache.hadoop.fs.{_}
import org.apache.hadoop.conf.Configuration
import org.apache.commons.io.FilenameUtils._
import java.io.File
import java.net.URI
val fs = FileSystem.get(new URI(outputfolder_Path), new Configuration())
fs.globStatus(
new Path(outputfolder_Path + "/*.*")).foreach { l: FileStatus => {
val newhdfsfileName = new Path(getFullPathNoEndSeparator(l.getPath.toString) + File.separatorChar + prefix + getName(l.getPath.toString))
// fs.rename(new Path(l.getPath.toString),newhdfsfileName )
val change = s"""
|original ${ new Path(l.getPath.toString) } --> new $newhdfsfileName
|""".stripMargin
println( change)
}
}
}
Caller would be for example :
val outputfolder_Path = "/a/b/c/d/e/f/"
prefixHdfsFiles(outputfolder_Path, "myprefix_")

Related

calculate folder size or file size in jenkins pipeline

We write our jenkins pipeline using groovy script. Is there any way to identify the folder size or file size.
Our goal is to identify size of two zip files and calculate the difference between them.
I tried below code but its not working.
stage('Calculate Opatch size')
{
def sampleDir = new File('${BuildPathPublishRoot}')
def sampleDirSize = sampleDir.directorySize()
echo sampleDirSize
}
Getting below error :-
hudson.remoting.ProxyException: groovy.lang.MissingMethodException: No signature of method: java.io.File.directorySize() is applicable for argument types: () values: []
Possible solutions: directorySize()
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:154)

Here's what worked for me. Grab all the files in a directory and sum the lengths.
Please note that you'll need to use quotes (") in order for string interpolation to work, i.e. "${BuildPathPublishRoot}" places the value of the BuildPathPublishRoot variable into the string, whereas '${BuildPathPublishRoot}' is taken literally to be the directory name.
workspaceSize = directorySize("${BuildPathPublishRoot}")
/** Computes bytes in the directory*/
public def directorySize(directory){
long bytes = 0
directory = (directory?:'').replace('\\','/')
directory = (directory =='') ? '' : (directory.endsWith('/') ? directory : "${directory}/")
def files=findFiles(glob: "${directory}*.*")
for (file in files) {
if (!file.isDirectory()){
bytes += file.length
}
}
return bytes
}

How to link classes from JDK into scaladoc-generated doc?

I'm trying to link classes from the JDK into the scaladoc-generated doc.
I've used the -doc-external-doc option of scaladoc 2.10.1 but without success.
I'm using -doc-external-doc:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar#http://docs.oracle.com/javase/7/docs/api/, but I get links such as index.html#java.io.File instead of index.html?java/io/File.html.
Seems like this option only works for scaladoc-generated doc.
Did I miss an option in scaladoc or should I fill a feature request?
I've configured sbt as follows:
scalacOptions in (Compile,doc) += "-doc-external-doc:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar#http://docs.oracle.com/javase/7/docs/api"
Note: I've seen the Opts.doc.externalAPI util in the upcoming sbt 0.13. I think a nice addition (not sure if it's possible) would be to pass a ModuleID instead of a File. The util would figure out which file corresponds to the ModuleID.

I use sbt 0.13.5.
There's no out-of-the-box way to have the feature of having Javadoc links inside scaladoc. And as my understanding goes, it's not sbt's fault, but the way scaladoc works. As Josh pointed out in his comment You should report to scaladoc.
There's however a workaround I came up with - postprocess the doc-generated scaladoc so the Java URLs get replaced to form proper Javadoc links.
The file scaladoc.sbt should be placed inside a sbt project and whenever doc task gets executed, the postprocessing via fixJavaLinksTask task kicks in.
NOTE There are lots of hardcoded paths so use it with caution (aka do the polishing however you see fit).
import scala.util.matching.Regex.Match
autoAPIMappings := true
// builds -doc-external-doc
apiMappings += (
file("/Library/Java/JavaVirtualMachines/jdk1.8.0_11.jdk/Contents/Home/jre/lib/rt.jar") ->
url("http://docs.oracle.com/javase/8/docs/api")
)
lazy val fixJavaLinksTask = taskKey[Unit](
"Fix Java links - replace #java.io.File with ?java/io/File.html"
)
fixJavaLinksTask := {
println("Fixing Java links")
val t = (target in (Compile, doc)).value
(t ** "*.html").get.filter(hasJavadocApiLink).foreach { f =>
println("fixing " + f)
val newContent = javadocApiLink.replaceAllIn(IO.read(f), fixJavaLinks)
IO.write(f, newContent)
}
}
val fixJavaLinks: Match => String = m =>
m.group(1) + "?" + m.group(2).replace(".", "/") + ".html"
val javadocApiLink = """\"(http://docs\.oracle\.com/javase/8/docs/api/index\.html)#([^"]*)\"""".r
def hasJavadocApiLink(f: File): Boolean = (javadocApiLink findFirstIn IO.read(f)).nonEmpty
fixJavaLinksTask <<= fixJavaLinksTask triggeredBy (doc in Compile)

I took the answer by #jacek-laskowski and modified it so that it avoid hard-coded strings and could be used for any number of Java libraries, not just the standard one.
Edit: the location of rt.jar is now determined from the runtime using sun.boot.class.path and does not have to be hard coded.
The only thing you need to modify is the map, which I have called externalJavadocMap in the following:
import scala.util.matching.Regex
import scala.util.matching.Regex.Match
val externalJavadocMap = Map(
"owlapi" -> "http://owlcs.github.io/owlapi/apidocs_4_0_2/index.html"
)
/*
* The rt.jar file is located in the path stored in the sun.boot.class.path system property.
* See the Oracle documentation at http://docs.oracle.com/javase/6/docs/technotes/tools/findingclasses.html.
*/
val rtJar: String = System.getProperty("sun.boot.class.path").split(java.io.File.pathSeparator).collectFirst {
case str: String if str.endsWith(java.io.File.separator + "rt.jar") => str
}.get // fail hard if not found
val javaApiUrl: String = "http://docs.oracle.com/javase/8/docs/api/index.html"
val allExternalJavadocLinks: Seq[String] = javaApiUrl +: externalJavadocMap.values.toSeq
def javadocLinkRegex(javadocURL: String): Regex = ("""\"(\Q""" + javadocURL + """\E)#([^"]*)\"""").r
def hasJavadocLink(f: File): Boolean = allExternalJavadocLinks exists {
javadocURL: String =>
(javadocLinkRegex(javadocURL) findFirstIn IO.read(f)).nonEmpty
}
val fixJavaLinks: Match => String = m =>
m.group(1) + "?" + m.group(2).replace(".", "/") + ".html"
/* You can print the classpath with `show compile:fullClasspath` in the SBT REPL.
* From that list you can find the name of the jar for the managed dependency.
*/
lazy val documentationSettings = Seq(
apiMappings ++= {
// Lookup the path to jar from the classpath
val classpath = (fullClasspath in Compile).value
def findJar(nameBeginsWith: String): File = {
classpath.find { attributed: Attributed[File] => (attributed.data ** s"$nameBeginsWith*.jar").get.nonEmpty }.get.data // fail hard if not found
}
// Define external documentation paths
(externalJavadocMap map {
case (name, javadocURL) => findJar(name) -> url(javadocURL)
}) + (file(rtJar) -> url(javaApiUrl))
},
// Override the task to fix the links to JavaDoc
doc in Compile <<= (doc in Compile) map {
target: File =>
(target ** "*.html").get.filter(hasJavadocLink).foreach { f =>
//println(s"Fixing $f.")
val newContent: String = allExternalJavadocLinks.foldLeft(IO.read(f)) {
case (oldContent: String, javadocURL: String) =>
javadocLinkRegex(javadocURL).replaceAllIn(oldContent, fixJavaLinks)
}
IO.write(f, newContent)
}
target
}
)
I am using SBT 0.13.8.

GATE ML Information Extraction process fail to produce proper class-level

I am trying to learn Machine Learning. In case of Information Extraction, save files are getting populated properly with data. But Number of class level is 0 in NLPFeaturesData.save file and log is also sounds like
93 #numTrainingDocs
0 #numClasses
53738 #numNullLabelInstances
9006940 #totalFeatures
C:\...\learnedModels.save #modelFile
SVMLibSvmJava #learnerName
null #learnerExecutable
-c 0.7 -t 0 -m 100 -tau 0.4 #learnerParams
I have run the following code base. But generated "NLPFeatureData.save" file's all class level is 0. Could someone please help me where I went wrong in machine learning.
try{
// ***************** load Gate & it's plugin [ Load learning] ***********************************
System.setProperty("gate.home", "C:\\Program Files\\GATE_Developer_7.1");
Gate.init();
Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), ANNIEConstants.PLUGIN_DIR).toURI().toURL());
Gate.getCreoleRegister().registerDirectories(new URL(FILE_WORK_PATH+"/plugins/Learning"));
// ****************** Instantiate corpus and load training documents *****************************
gate.Corpus corpus = (Corpus) Factory.createResource("gate.corpora.CorpusImpl");
FileFilter fileFilter = new FileFilter() {
public boolean accept(File pathname) {
// TODO Auto-generated method stub
return true;
}
};
corpus.populate(new URL(".../corpus"),fileFilter,"UTF-8",false);
Gate.getCreoleRegister().registerDirectories();
//Make a pipeline and add the corpus
FeatureMap pfm = Factory.newFeatureMap();
pfm.put("corpus", corpus);
pipeline = (gate.creole.SerialAnalyserController)gate.Factory.createResource("gate.creole.SerialAnalyserController", pfm);
initAnnie();
//********************************* Configure with relations config file and learning api
File configFile = new File("../learning-config.xml"); //Wherever it is
RunMode mode = RunMode.TRAINING; //or TRAINING, or APPLICATION ..
FeatureMap fm = Factory.newFeatureMap();
fm.put("configFileURL", configFile.toURI().toURL());
fm.put("learningMode", mode);
gate.learning.LearningAPIMain learner = (gate.learning.LearningAPIMain)gate.Factory.createResource("gate.learning.LearningAPIMain", fm);
pipeline.add(learner);
pipeline.execute();
}catch(Exception e){
e.printStackTrace();
}
}
private static void initAnnie() throws GateException {
for(int i = 0; i < ANNIEConstants.PR_NAMES.length; i++) {
FeatureMap params = Factory.newFeatureMap(); // use default parameters
ProcessingResource pr = (ProcessingResource)
Factory.createResource(ANNIEConstants.PR_NAMES[i], params);
pipeline.add(pr);
}
}

Finally I have resolved this problem. I have added following AnnotationSet with gate.learning.LearningAPIMain instance.
learner.setInputASName("Key");
learner.setOutputASName("Key");
Now my saved files are generating with proper format.

How to collect directory listing along with each file CRC checksum?

I use the following command to get dir listing in nix(Linux, AIX, Sunos, HPUX) platforms
Command
ls -latr
Ouput
drwxr-xr-x 2 ricky support 4096 Aug 29 11:59 lib
-rwxrwxrwx 1 ricky support 924 Aug 29 12:00 initservice.sh
cksum command is used for getting CRC checksum.
How can the CRC Checksum be appended after each file something (including directory listing too) like below, maintaining the below format in these nix(Linux, AIX, Sunos, HPUX) platforms?
drwxr-xr-x 2 ricky support 4096 Aug 29 11:59 lib
-rwxrwxrwx 1 ricky support 924 Aug 29 12:00 initservice.sh 4287252281
Update Note : No third party application, I am using java/Groovy to parse the output ultimately into a given format which forms a xml using groovy XmlSlurper (XML's get generated around 5MB sized)
"permission","hardlink","owner","group","fsize","month","date","time","filename","checksum"
All Suggestions are welcome! :)
Update with my code
But here I am calculating md5hex which gives a similar output as md5sum command from linux. So it's no longer cksum as I cannot use jacksum bcz of some licensing issue :(
class CheckSumCRC32 {
public def getFileListing(String file){
def dir = new File(file)
def filename = null
def md5sum = null
def filesize = null
def lastmodified = null
def lastmodifiedDate = null
def lastmodifiedTime = null
def permission = null
Format formatter = null
def list=[]
if(dir.exists()){
dir.eachFileRecurse (FileType.FILES) { fname ->
list << fname
}
list.each{fileob->
try{
md5sum=getMD5CheckSum(fileob.toString())
filesize=fileob.length()+"b"
lastmodified=new Date(fileob.lastModified())
lastmodifiedDate=lastmodified.format('dd/MM/yyyy')
formatter=new SimpleDateFormat("hh:mm:ss a")
lastmodifiedTime=formatter.format(lastmodified)
permission=getReadPermissions(fileob)+getWritePermissions(fileob)+getExecutePermissions(fileob)
filename=getRelativePath("E:\\\\temp\\\\recurssive\\\\",fileob.toString())
println "$filename, $md5sum, $lastmodifiedDate, $filesize, $permission, $lastmodifiedDate, $lastmodifiedTime "
}
catch(IOException io){
println io
}
catch(FileNotFoundException fne){
println fne
}
catch(Exception e){
println e
}
}
}
}
public def getReadPermissions(def file){
String temp="-"
if(file.canRead())temp="r"
return temp
}
public def getWritePermissions(def file){
String temp="-"
if(file.canWrite())temp="w"
return temp
}
public def getExecutePermissions(def file){
String temp="-"
if(file.canExecute())temp="x"
return temp
}
public def getRelativePath(def main, def file){""
return file.toString().replaceAll(main, "")
}
public static void main(String[] args) {
CheckSumCRC32 crc = new CheckSumCRC32();
crc.getFileListing("E:\\temp\\recurssive")
}
}
Output
release.zip, 25f995583144bebff729086ae6ec0eb2, 04/06/2012, 6301510b, rwx, 04/06/2012, 02:46:32 PM
file\check\release-1.0.zip, 3cc0f2b13778129c0cc41fb2fdc7a85f, 18/07/2012, 11786307b, rwx, 18/07/2012, 04:13:47 PM
file\Dedicated.mp3, 238f793f0b80e7eacf5fac31d23c65d4, 04/05/2010, 4650908b, rwx, 04/05/2010, 10:45:32 AM
but still I need a way to calculate hardlink, owner & group. I searched on the net it looks like java7 has this capability & I am stuck with java6. Any help?

Take a look at http://www.jonelo.de/java/jacksum/index.html - it is reported to provide cksum - compatible CRC32 checksums.
BTW, I tried using java.util.zip.CRC32 to calculate checksums, and it gives a different value than cksum does, so must use a slightly different algorithm.
EDIT: I tried jacksum, and it works, but you have to tell it to use the 'cksum' algorithm - apparently that is different from crc32, which jacksum also supports.

Well, you could run the command, then, for each line, run the cksum and append it to the line.
I did the following:
dir = "/home/will"
"ls -latr $dir".execute().in.eachLine { line ->
// let's omit the first line, which starts with "total"
if (line =~ /^total/) return
// for directories, we just print the line
if (line =~ /^d/)
{
println line
}
else
{
// for files, we split the line by one or more spaces and join
// the last pieces to form the filename (there must be a better
// way to do this)
def fileName = line.split(/ {1,}/)[8..-1].join("")
// now we get the first part of the cksum
def cksum = "cksum $dir/$fileName".execute().in.text.split(/ {1,}/)[0]
// concat the result to the original line and print it
println "$line $cksum"
}
}
Special attention to my "there must be a better way to do this".

Batch file renaming – inserting text from a list (in Python or Java)

I'm finishing a business card production flow (excel > xml > indesign > single page pdfs) and I would like to insert the employees' names in the filenames.
What I have now:
BusinessCard_01_Blue.pdf
BusinessCard_02_Blue.pdf
BusinessCard_03_Blue.pdf (they are gonna go up to the hundreds)
What I need (I can manipulate the name list with regex easily):
BusinessCard_01_CarlosJorgeSantos_Blue.pdf
BusinessCard_02_TaniaMartins_Blue.pdf
BusinessCard_03_MarciaLima_Blue.pdf
I'm a Java and Python toddler. I've read the related questions, tried this in Automator (Mac) and Name Mangler, but couldn't get it to work.
Thanks in advance,
Gus

Granted you have a map where to look at the right name you could do something like this in Java:
List<Files> originalFiles = ...
for( File f : originalFiles ) {
f.renameTo( new File( getNameFor( f ) ) );
}
And define the getNameFor to something like:
public String getNameFor( File f ) {
Map<String,String> namesMap = ...
return namesMap.get( f.getName() );
}
In the map you'll have the associations:
BusinessCard_01_Blue.pdf => BusinessCard_01_CarlosJorgeSantos_Blue.pdf
Does it make sense?

In Python (tested):
#!/usr/bin/python
import sys, os, shutil, re
try:
pdfpath = sys.argv[1]
except IndexError:
pdfpath = os.curdir
employees = {1:'Bob', 2:'Joe', 3:'Sara'} # emp_id:'name'
files = [f for f in os.listdir(pdfpath) if re.match("BusinessCard_[0-9]+_Blue.pdf", f)]
idnumbers = [int(re.search("[0-9]+", f).group(0)) for f in files]
filenamemap = zip(files, [employees[i] for i in idnumbers])
newfiles = [re.sub('Blue.pdf', e + '_Blue.pdf', f) for f, e in filenamemap]
for old, new in zip(files, newfiles):
shutil.move(os.path.join(pdfpath, old), os.path.join(pdfpath, new))
EDIT: This now alters only those files that have not yet been altered.
Let me know if you want something that will build the the employees dictionary automatically.

If you have a list of names in the same order the files are produced, in Python it goes like this untested fragment:
#!/usr/bin/python
import os
f = open('list.txt', 'r')
for n, name in enumerate(f):
original_name = 'BusinessCard_%02d_Blue.pdf' % (n + 1)
new_name = 'BusinessCard_%02d_%s_Blue.pdf' % (
n, ''.join(name.title().split()))
if os.path.isfile(original_name):
print "Renaming %s to %s" % (original_name, new_name),
os.rename(original_name, new_name)
print "OK!"
else:
print "File %s not found." % original_name

Python:
Assuming you have implemented the naming logic already:
for f in os.listdir(<directory>):
try:
os.rename(f, new_name(f.name))
except OSError:
# fail
You will, of course, need to write a function new_name which takes the string "BusinessCard_01_Blue.pdf" and returns the string "BusinessCard_01_CarlosJorgeSantos_Blue.pdf".

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Change output filename prefix for DataFrame.write() - java

Related

calculate folder size or file size in jenkins pipeline

How to link classes from JDK into scaladoc-generated doc?

GATE ML Information Extraction process fail to produce proper class-level

How to collect directory listing along with each file CRC checksum?

Batch file renaming – inserting text from a list (in Python or Java)

Categories

Resources