public class ClusterByTimestamp extends CopyFilesNaming<ClusterMembership>
The timestamp is chosen, in this order of priority:
Timezones are assumed to be the current time-zone, if not otherwise indicated.
File modification time is not considered.
The clustered are named 01, 02, 03 etc. depending on the number of clusters.
The DBSCAN algorithm is used for clustering.
A special cluster OUTLIER_CLUSTER_IDENTIFIER
may also be created, for points that
were not density-reachable by others, and aren't part of any cluster in particular.
The relative-path of files are preserved, being added relative to the cluster subdirectory.
The default-patterns for matching filenames are:
yyyy-mm-dd hh:mm:ss
yyyymmdd_hhmmss
yyyymmdd hhmmss
Constructor and Description |
---|
ClusterByTimestamp() |
Modifier and Type | Method and Description |
---|---|
ClusterMembership |
beforeCopying(Path destinationDirectory,
List<FileWithDirectoryInput> inputs)
To be called once before any calls to
CopyFilesNaming.destinationPath(File,
DirectoryWithPrefix, int, CopyContext) . |
Optional<Path> |
destinationPathRelative(File file,
DirectoryWithPrefix outputTarget,
int index,
CopyContext<ClusterMembership> context)
Calculates the relative-output path (to be appended to destDir)
|
int |
getMinimumPerCluster()
The minimum number of files that must exist for a cluster.
|
double |
getThresholdHours()
Files whose creation-time differs
<= this parameter are joined into the same cluster. |
List<TimestampPattern> |
getTimestampPatterns()
The patterns which can be used to extract a date-time from a filename.
|
int |
getTimeZoneOffset()
If
>= 0 , sets a specific time-offset in hours. |
boolean |
isPreserveSubdirectories()
If true, the entire relative-path is used when copying files into the cluster directory.
|
void |
setMinimumPerCluster(int minimumPerCluster)
The minimum number of files that must exist for a cluster.
|
void |
setPreserveSubdirectories(boolean preserveSubdirectories)
If true, the entire relative-path is used when copying files into the cluster directory.
|
void |
setThresholdHours(double thresholdHours)
Files whose creation-time differs
<= this parameter are joined into the same cluster. |
void |
setTimestampPatterns(List<TimestampPattern> timestampPatterns)
The patterns which can be used to extract a date-time from a filename.
|
void |
setTimeZoneOffset(int timeZoneOffset)
If
>= 0 , sets a specific time-offset in hours. |
destinationPath
checkMisconfigured, describeBean, describeChildren, duplicateBean, fields, findFieldsOfClass, getBeanName, getLocalPath, localise, toString
public ClusterMembership beforeCopying(Path destinationDirectory, List<FileWithDirectoryInput> inputs) throws OperationFailedException
CopyFilesNaming
CopyFilesNaming.destinationPath(File,
DirectoryWithPrefix, int, CopyContext)
.beforeCopying
in class CopyFilesNaming<ClusterMembership>
destinationDirectory
- the directory to which files are copied.inputs
- the total number of files to copy.OperationFailedException
public Optional<Path> destinationPathRelative(File file, DirectoryWithPrefix outputTarget, int index, CopyContext<ClusterMembership> context) throws OutputWriteFailedException
CopyFilesNaming
destinationPathRelative
in class CopyFilesNaming<ClusterMembership>
file
- file to be copiedoutputTarget
- the directory and prefix associated with the file for outputtingindex
- an increasing sequence of numbers for each file beginning at 0context
- the context for the copyingOutputWriteFailedException
public double getThresholdHours()
<=
this parameter are joined into the same cluster.
This is the principle parameter for affecting the sensitivity of the clustering. It is specified in hours between the date-time of two files.
A larger value encourages a smaller total number of clusters (or larger cluster-size). A smaller values encourages the opposite.
public void setThresholdHours(double thresholdHours)
<=
this parameter are joined into the same cluster.
This is the principle parameter for affecting the sensitivity of the clustering. It is specified in hours between the date-time of two files.
A larger value encourages a smaller total number of clusters (or larger cluster-size). A smaller values encourages the opposite.
public int getMinimumPerCluster()
public void setMinimumPerCluster(int minimumPerCluster)
public boolean isPreserveSubdirectories()
public void setPreserveSubdirectories(boolean preserveSubdirectories)
public List<TimestampPattern> getTimestampPatterns()
public void setTimestampPatterns(List<TimestampPattern> timestampPatterns)
public int getTimeZoneOffset()
>= 0
, sets a specific time-offset in hours. If == -1
, then the offset is
taken from the current system time-zone settings.public void setTimeZoneOffset(int timeZoneOffset)
>= 0
, sets a specific time-offset in hours. If == -1
, then the offset is
taken from the current system time-zone settings.Copyright © 2010–2023 Owen Feehan, ETH Zurich, University of Zurich, Hoffmann-La Roche. All rights reserved.