Skip to content

Update 3 UDFs: Percentile, Quantile and Cluster#17375

Open
suyx1999 wants to merge 10 commits intoapache:masterfrom
suyx1999:fix-udf
Open

Update 3 UDFs: Percentile, Quantile and Cluster#17375
suyx1999 wants to merge 10 commits intoapache:masterfrom
suyx1999:fix-udf

Conversation

@suyx1999
Copy link
Copy Markdown

@suyx1999 suyx1999 commented Mar 27, 2026

Update 3 UDFs: Percentile, Quantile and Cluster

Fix issues in Percentile and Quantile UDFs

Percentile

  • Fix out-of-bounds index issue

Quantile

  • Fix incorrect type conversion

Update Cluster UDF

  • This function takes a single input time series, splits it into non-overlapping contiguous subsequences (windows) of fixed length l, and clusters those subsequences to discover local patterns or segment structure.

Input series

  • One input series only.
  • Types: INT32 / INT64 / FLOAT / DOUBLE.
  • Points are read in time order; trailing samples that do not fill a full window are dropped (only ⌊n/l⌋ windows are used, where n is the number of valid points).

Parameters

Name Meaning Default Notes
l Subsequence (window) length (required) Positive integer; each window has l consecutive samples.
k Number of clusters (required) Integer ≥ 2.
method Clustering algorithm kmeans Optional: kmeans, kshape, medoidshape (case-insensitive). Defaults to k-means if omitted.
norm Z-score normalize each subsequence true Boolean; if true, each subsequence is standardized before clustering.
maxiter Maximum iterations 200 Positive integer.
output Output mode label label: one cluster id per window; centroid: concatenate the k centroid vectors in cluster order.
sample_rate Greedy sampling rate 0.3 Used only when method = medoidshape; must be in (0, 1].

method details

  • kmeans: k-means in Euclidean space (optionally after per-window normalization).
  • kshape: Assign by shape-based distance (SBD from normalized cross-correlation, NCC); centroids updated via SVD on the cluster matrix.
  • medoidshape: Coarse k-means with min(2k, number of windows) clusters, then greedy selection of k representative subsequences; sample_rate controls how many candidates are sampled each round.

Output series

Controlled by output:

output = label (default)

  • One output series, type INT32.
  • Number of points = number of full windows, ⌊n/l⌋.
  • Timestamp of each point = time of the first sample in that window; value = cluster id 0 … k−1.

output = centroid

  • One output series, type DOUBLE.
  • Number of points = k × l: for clusters 0 → k−1, emit the l components of each centroid in order (concatenated).
  • Timestamps are 0, 1, 2, … (placeholders only, no physical time meaning).

@suyx1999 suyx1999 changed the title Fix issues in Percentile and Quantile UDFs Update 3 UDFs: Percentile, Quantile and Cluster Apr 7, 2026
@JackieTien97 JackieTien97 requested a review from Copilot April 7, 2026 06:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the library UDF suite by fixing correctness issues in Percentile/Quantile implementations and introducing a new cluster UDTF for subsequence (window-based) clustering in the dlearn module.

Changes:

  • Fixes Percentile-related edge cases (e.g., out-of-bounds handling in GK sketch compression; discrete nearest-rank percentile indexing).
  • Adjusts Quantile UDF value encoding/decoding logic for KLL-based quantile computation.
  • Adds UDTFCluster plus clustering utilities (k-means, k-shape, medoid-shape) and integration tests + registration scripts.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
library-udf/src/main/java/org/apache/iotdb/library/dprofile/util/GKArray.java Prevents OOB during merge/compress when incoming entries are exhausted.
library-udf/src/main/java/org/apache/iotdb/library/dprofile/util/ExactOrderStatistics.java Fixes percentile indexing via discrete nearest-rank; updates class documentation.
library-udf/src/main/java/org/apache/iotdb/library/dprofile/UDAFQuantile.java Updates numeric-to-long encoding logic used by the quantile sketch and output casting.
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/MedoidShape.java Adds medoid-shape clustering implementation (coarse k-means + greedy representative selection).
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/KShape.java Adds k-Shape clustering implementation (SBD assignment + SVD centroid update).
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/KMeans.java Adds univariate-window k-means implementation for subsequences.
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/ClusterUtils.java Adds shared utilities (z-normalization, Euclidean distance, FFT-based NCC/SBD).
library-udf/src/main/java/org/apache/iotdb/library/dlearn/UDTFCluster.java Introduces cluster UDTF for windowing a single series and clustering windows; supports label/centroid output.
library-udf/src/assembly/tools/register-UDF.sh Registers the new cluster UDF in the Unix registration script.
library-udf/src/assembly/tools/register-UDF.bat Registers the new cluster UDF in the Windows registration script.
integration-test/src/test/java/org/apache/iotdb/libudf/it/dlearn/DLearnIT.java Adds cluster UDF integration tests and a toy series dataset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +94 to +106
private long dataToLong(double res) {
switch (dataType) {
case INT32:
return (int) data;
return (int) res;
case FLOAT:
result = Float.floatToIntBits((float) data);
return (float) data >= 0f ? result : result ^ Long.MAX_VALUE;
float f = (float) res;
long flBits = Float.floatToIntBits(f);
return f >= 0f ? flBits : flBits ^ Long.MAX_VALUE;
case INT64:
return (long) data;
return (long) res;
case DOUBLE:
result = Double.doubleToLongBits((double) data);
return (double) data >= 0d ? result : result ^ Long.MAX_VALUE;
case BLOB:
case BOOLEAN:
case STRING:
case TEXT:
case DATE:
case TIMESTAMP:
long d = Double.doubleToLongBits(res);
return res >= 0d ? d : d ^ Long.MAX_VALUE;
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataToLong(double res) still routes INT64 values through a double and then casts back to long. For INT64 values beyond 2^53, this loses integer precision and can change the quantile result ordering. Consider ingesting INT32/INT64 using row.getInt(0) / row.getLong(0) (or otherwise passing the original integer type into dataToLong) so the sketch sees exact values.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have fixed this issue

Comment on lines +237 to 249
/** Discrete nearest-rank index into sorted data of length {@code n}; see class Javadoc. */
private static int discreteNearestRankIndex(int n, double phi) {
int idx = (int) Math.ceil(n * phi) - 1;
return Math.max(0, Math.min(n - 1, idx));
}

public static float getPercentile(FloatArrayList nums, double phi) {
if (nums.isEmpty()) {
throw new NoSuchElementException();
} else {
nums.sortThis();
return nums.get((int) Math.ceil(nums.size() * phi));
return nums.get(discreteNearestRankIndex(nums.size(), phi));
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says unit tests were added for the Percentile and Quantile UDF fixes, but this change set only adds cluster-related IT coverage (and no new tests exercising the updated Percentile/Quantile edge cases). If tests were intended, they appear to be missing from the PR.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove tests from PR description. There are existing IT tests for Percentile and Quantile.

Comment on lines 86 to 90
case DOUBLE:
collector.putDouble(0, res);
break;
case TIMESTAMP:
case DATE:
case TEXT:
case STRING:
case BLOB:
case BOOLEAN:
default:
break;
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terminate(), the rank-to-index conversion appears off by one: findMinValueWithRank((long) (rank * sketch.getN())) will pass k == n when rank == 1, but getApproxRank(v) counts values strictly < v, so the valid target ranks are typically in [0, n-1] (nearest-rank would use ceil(n*rank)-1). As-is, rank=1 can return a value larger than the max input (and other ranks are shifted).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have fixed this issue

Comment on lines +138 to +142
int sampleCount = Math.max(1, (int) (r * n));
sampleCount = Math.min(sampleCount, pool.size());
Collections.shuffle(pool, rnd);
List<Integer> sampleIdx = pool.subList(0, sampleCount);

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MedoidShape.fastKShape() uses Collections.shuffle(pool, rnd) with a default new Random() seed, so results can vary between runs (especially when multiple candidates tie on delta, where the shuffled iteration order decides the winner). For a database UDF this can make repeated queries non-reproducible; consider making the default selection deterministic (e.g., fixed seed, no shuffle when sampling all candidates, or deterministic tie-breaking).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clustering techniques inherently involve randomness, which is unavoidable.

suyx1999 and others added 5 commits April 7, 2026 15:12
…/cluster/MedoidShape.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…Cluster.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants