Update 3 UDFs: Percentile, Quantile and Cluster by suyx1999 · Pull Request #17375 · apache/iotdb

suyx1999 · 2026-03-27T07:27:47Z

Update 3 UDFs: Percentile, Quantile and Cluster

Fix issues in Percentile and Quantile UDFs

Percentile

Fix out-of-bounds index issue

Quantile

Fix incorrect type conversion

Update Cluster UDF

This function takes a single input time series, splits it into non-overlapping contiguous subsequences (windows) of fixed length l, and clusters those subsequences to discover local patterns or segment structure.

Input series

One input series only.
Types: INT32 / INT64 / FLOAT / DOUBLE.
Points are read in time order; trailing samples that do not fill a full window are dropped (only ⌊n/l⌋ windows are used, where n is the number of valid points).

Parameters

Name	Meaning	Default	Notes
`l`	Subsequence (window) length	(required)	Positive integer; each window has `l` consecutive samples.
`k`	Number of clusters	(required)	Integer ≥ 2.
`method`	Clustering algorithm	`kmeans`	Optional: `kmeans`, `kshape`, `medoidshape` (case-insensitive). Defaults to k-means if omitted.
`norm`	Z-score normalize each subsequence	`true`	Boolean; if `true`, each subsequence is standardized before clustering.
`maxiter`	Maximum iterations	`200`	Positive integer.
`output`	Output mode	`label`	`label`: one cluster id per window; `centroid`: concatenate the `k` centroid vectors in cluster order.
`sample_rate`	Greedy sampling rate	`0.3`	Used only when `method` = `medoidshape`; must be in `(0, 1]`.

`method` details

kmeans: k-means in Euclidean space (optionally after per-window normalization).
kshape: Assign by shape-based distance (SBD from normalized cross-correlation, NCC); centroids updated via SVD on the cluster matrix.
medoidshape: Coarse k-means with min(2k, number of windows) clusters, then greedy selection of k representative subsequences; sample_rate controls how many candidates are sampled each round.

Output series

Controlled by output:

`output` = `label` (default)

One output series, type INT32.
Number of points = number of full windows, ⌊n/l⌋.
Timestamp of each point = time of the first sample in that window; value = cluster id 0 … k−1.

`output` = `centroid`

One output series, type DOUBLE.
Number of points = k × l: for clusters 0 → k−1, emit the l components of each centroid in order (concatenated).
Timestamps are 0, 1, 2, … (placeholders only, no physical time meaning).

Copilot

Pull request overview

This PR updates the library UDF suite by fixing correctness issues in Percentile/Quantile implementations and introducing a new cluster UDTF for subsequence (window-based) clustering in the dlearn module.

Changes:

Fixes Percentile-related edge cases (e.g., out-of-bounds handling in GK sketch compression; discrete nearest-rank percentile indexing).
Adjusts Quantile UDF value encoding/decoding logic for KLL-based quantile computation.
Adds UDTFCluster plus clustering utilities (k-means, k-shape, medoid-shape) and integration tests + registration scripts.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
library-udf/src/main/java/org/apache/iotdb/library/dprofile/util/GKArray.java	Prevents OOB during merge/compress when incoming entries are exhausted.
library-udf/src/main/java/org/apache/iotdb/library/dprofile/util/ExactOrderStatistics.java	Fixes percentile indexing via discrete nearest-rank; updates class documentation.
library-udf/src/main/java/org/apache/iotdb/library/dprofile/UDAFQuantile.java	Updates numeric-to-long encoding logic used by the quantile sketch and output casting.
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/MedoidShape.java	Adds medoid-shape clustering implementation (coarse k-means + greedy representative selection).
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/KShape.java	Adds k-Shape clustering implementation (SBD assignment + SVD centroid update).
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/KMeans.java	Adds univariate-window k-means implementation for subsequences.
library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/ClusterUtils.java	Adds shared utilities (z-normalization, Euclidean distance, FFT-based NCC/SBD).
library-udf/src/main/java/org/apache/iotdb/library/dlearn/UDTFCluster.java	Introduces `cluster` UDTF for windowing a single series and clustering windows; supports label/centroid output.
library-udf/src/assembly/tools/register-UDF.sh	Registers the new `cluster` UDF in the Unix registration script.
library-udf/src/assembly/tools/register-UDF.bat	Registers the new `cluster` UDF in the Windows registration script.
integration-test/src/test/java/org/apache/iotdb/libudf/it/dlearn/DLearnIT.java	Adds cluster UDF integration tests and a toy series dataset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T06:36:27Z

library-udf/src/main/java/org/apache/iotdb/library/dprofile/UDAFQuantile.java

+  private long dataToLong(double res) {
    switch (dataType) {
      case INT32:
-        return (int) data;
+        return (int) res;
      case FLOAT:
-        result = Float.floatToIntBits((float) data);
-        return (float) data >= 0f ? result : result ^ Long.MAX_VALUE;
+        float f = (float) res;
+        long flBits = Float.floatToIntBits(f);
+        return f >= 0f ? flBits : flBits ^ Long.MAX_VALUE;
      case INT64:
-        return (long) data;
+        return (long) res;
      case DOUBLE:
-        result = Double.doubleToLongBits((double) data);
-        return (double) data >= 0d ? result : result ^ Long.MAX_VALUE;
-      case BLOB:
-      case BOOLEAN:
-      case STRING:
-      case TEXT:
-      case DATE:
-      case TIMESTAMP:
+        long d = Double.doubleToLongBits(res);
+        return res >= 0d ? d : d ^ Long.MAX_VALUE;


dataToLong(double res) still routes INT64 values through a double and then casts back to long. For INT64 values beyond 2^53, this loses integer precision and can change the quantile result ordering. Consider ingesting INT32/INT64 using row.getInt(0) / row.getLong(0) (or otherwise passing the original integer type into dataToLong) so the sketch sees exact values.

Have fixed this issue

library-udf/src/main/java/org/apache/iotdb/library/dlearn/UDTFCluster.java

library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/MedoidShape.java

Copilot · 2026-04-07T06:36:29Z

library-udf/src/main/java/org/apache/iotdb/library/dprofile/util/ExactOrderStatistics.java

+  /** Discrete nearest-rank index into sorted data of length {@code n}; see class Javadoc. */
+  private static int discreteNearestRankIndex(int n, double phi) {
+    int idx = (int) Math.ceil(n * phi) - 1;
+    return Math.max(0, Math.min(n - 1, idx));
+  }
+
  public static float getPercentile(FloatArrayList nums, double phi) {
    if (nums.isEmpty()) {
      throw new NoSuchElementException();
    } else {
      nums.sortThis();
-      return nums.get((int) Math.ceil(nums.size() * phi));
+      return nums.get(discreteNearestRankIndex(nums.size(), phi));
    }


PR description says unit tests were added for the Percentile and Quantile UDF fixes, but this change set only adds cluster-related IT coverage (and no new tests exercising the updated Percentile/Quantile edge cases). If tests were intended, they appear to be missing from the PR.

Remove tests from PR description. There are existing IT tests for Percentile and Quantile.

Copilot · 2026-04-07T06:36:29Z

library-udf/src/main/java/org/apache/iotdb/library/dprofile/UDAFQuantile.java

      case DOUBLE:
        collector.putDouble(0, res);
        break;
-      case TIMESTAMP:
-      case DATE:
-      case TEXT:
-      case STRING:
-      case BLOB:
-      case BOOLEAN:
      default:
        break;


In terminate(), the rank-to-index conversion appears off by one: findMinValueWithRank((long) (rank * sketch.getN())) will pass k == n when rank == 1, but getApproxRank(v) counts values strictly < v, so the valid target ranks are typically in [0, n-1] (nearest-rank would use ceil(n*rank)-1). As-is, rank=1 can return a value larger than the max input (and other ranks are shifted).

Have fixed this issue

Copilot · 2026-04-07T06:36:29Z

library-udf/src/main/java/org/apache/iotdb/library/dlearn/util/cluster/MedoidShape.java

+      int sampleCount = Math.max(1, (int) (r * n));
+      sampleCount = Math.min(sampleCount, pool.size());
+      Collections.shuffle(pool, rnd);
+      List<Integer> sampleIdx = pool.subList(0, sampleCount);
+


MedoidShape.fastKShape() uses Collections.shuffle(pool, rnd) with a default new Random() seed, so results can vary between runs (especially when multiple candidates tie on delta, where the shuffled iteration order decides the winner). For a database UDF this can make repeated queries non-reproducible; consider making the default selection deterministic (e.g., fixed seed, no shuffle when sampling all candidates, or deterministic tie-breaking).

Clustering techniques inherently involve randomness, which is unavoidable.

…/cluster/MedoidShape.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…Cluster.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

suyx1999 added 5 commits March 27, 2026 15:15

Fix two UDFs Percentile and Quantile

978f8ff

Add unit test dependency in pom.xml

c079ad1

Remove Additional Tests

ee55d77

Update UDTFCluster

9b40088

Update DLearnIT and register-UDF

81d9efb

suyx1999 changed the title ~~Fix issues in Percentile and Quantile UDFs~~ Update 3 UDFs: Percentile, Quantile and Cluster Apr 7, 2026

JackieTien97 requested a review from Copilot April 7, 2026 06:30

Copilot started reviewing on behalf of JackieTien97 April 7, 2026 06:31 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

suyx1999 and others added 5 commits April 7, 2026 15:12

Update library-udf/src/main/java/org/apache/iotdb/library/dlearn/util…

5f3b75e

…/cluster/MedoidShape.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update library-udf/src/main/java/org/apache/iotdb/library/dlearn/UDTF…

3a1ca8c

…Cluster.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update format

4584745

Fix Bugs in Quantile

c1ddbe3

Fix errors in register-UDF.bat

1b1d69a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update 3 UDFs: Percentile, Quantile and Cluster#17375

Update 3 UDFs: Percentile, Quantile and Cluster#17375
suyx1999 wants to merge 10 commits intoapache:masterfrom
suyx1999:fix-udf

suyx1999 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

suyx1999 Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

suyx1999 Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

suyx1999 Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

suyx1999 Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suyx1999 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update 3 UDFs: Percentile, Quantile and Cluster

Fix issues in Percentile and Quantile UDFs

Percentile

Quantile

Update Cluster UDF

Input series

Parameters

method details

Output series

output = label (default)

output = centroid

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

suyx1999 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

suyx1999 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

suyx1999 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

suyx1999 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suyx1999 commented Mar 27, 2026 •

edited

Loading

`method` details

`output` = `label` (default)

`output` = `centroid`