Introduction
In this blog post, I would like to show how to implement hierarchical data clustering by using T-SQL.
Hierarchical clustering is probably the simplest method of cluster analysis. The are two basic types of this method - divisive (processing begins from the single cluster and on each step clusters are divided) and agglomerative (on the beginning each item resides in its own cluster, on the next steps we merge two closest clusters until all clusters are merged into single one).
The script below utilizes agglomerative approach.
Code
BEGIN TRY
SET NOCOUNT ON
-- 1. Declare & populate data set with sample items for analysis
DECLARE @Items TABLE (
Item INT NOT NULL,
X FLOAT NOT NULL,
Y FLOAT NOT NULL
)
INSERT
@Items(Item, X, Y)
VALUES
(1, 26, 18),
(2, 9, 47),
(3, 25, 15),
(4, 36, 27),
(5, 35, 25),
(6, 10, 43),
(7, 11, 44),
(8, 36, 24),
(9, 26, 14),
(10, 26, 14),
(11, 9, 45),
(12, 33, 23),
(13, 27, 16),
(14, 10, 47)
-- 2. Precalculate distances between items
DECLARE @Distances TABLE (
Item1 INT,
Item2 INT,
Distance FLOAT
)
INSERT
@Distances(Item1, Item2, Distance)
SELECT
I1.Item, I2.Item, SQRT(POWER(I1.X - I2.X, 2) + POWER(I1.Y - I2.Y, 2))
FROM
@Items I1
CROSS JOIN
@Items I2
-- 3. On the first iteration, each item resides in its own cluster
DECLARE
@Iteration INT = 1
DECLARE @Clusters TABLE (
Iteration INT NOT NULL,
Cluster INT NOT NULL,
Item INT NOT NULL
)
INSERT
@Clusters (Iteration, Cluster, Item)
SELECT
@Iteration, Item, Item
FROM
@Items
-- 4. In cycle, merge two closest clusters, until all clusters will be merged into single one
DECLARE
@Cluster1 INT,
@Cluster2 INT
WHILE 1 = 1 BEGIN
-- 4.1. Determine two closest clusters
;WITH CTE AS (
SELECT
Cluster1 = CI1.Cluster,
Item1 = CI1.Item,
Cluster2 = CI2.Cluster,
Item2 = CI2.Item
FROM
@Clusters CI1
CROSS JOIN
@Clusters CI2
WHERE
CI1.Iteration = @Iteration
AND
CI2.Iteration = @Iteration
AND
CI1.Item <> CI2.Item
AND
CI1.Cluster <> CI2.Cluster
)
SELECT TOP 1
@Cluster1 = CTE.Cluster1,
@Cluster2 = CTE.Cluster2
FROM
CTE
INNER JOIN
@Distances CM ON CTE.Item1 = CM.Item1 AND CTE.Item2 = CM.Item2
GROUP BY
CTE.Cluster1, CTE.Cluster2
ORDER BY
AVG(CM.Distance)
-- 4.2. Stop processing if all clusters are merged
IF @@ROWCOUNT = 0
BREAK
SET @Iteration = @Iteration + 1
-- 4.3. Merge two closest clusters, leave all other as is
INSERT
@Clusters (Iteration, Cluster, Item)
SELECT
@Iteration,
CASE WHEN Cluster IN (@Cluster1, @Cluster2) THEN @Cluster1 ELSE Cluster END,
Item
FROM
@Clusters
WHERE
Iteration = @Iteration - 1
END
-- 5. Generate layout for GraphViz to show dendrogram
DECLARE
@GraphVizLayout VARCHAR(MAX)
SET @GraphVizLayout = 'digraph Clusters' + '{' + CHAR(13) + CHAR(10)
-- 5.1. Output graph nodes
SELECT
@GraphVizLayout = @GraphVizLayout + 'N' + CAST(Iteration * 100 + Cluster AS VARCHAR) + '[label = "N' + CAST(Iteration AS VARCHAR) + '.' + CAST(Cluster AS VARCHAR) + '"];' + CHAR(13) + CHAR(10)
FROM
@Clusters
GROUP BY
Iteration, Cluster
-- 5.2. Output graph edges
SELECT
@GraphVizLayout = @GraphVizLayout + 'N' + CAST(CI2.Iteration * 100 + CI2.Cluster AS VARCHAR) + ' -> N' + CAST(CI1.Iteration * 100 + CI1.Cluster AS VARCHAR) + ';' + CHAR(13) + CHAR(10)
FROM
@Clusters CI1
INNER JOIN
@Clusters CI2 ON CI1.Item = CI2.Item AND CI1.Iteration = CI2.Iteration + 1
GROUP BY
CI1.Iteration, CI1.Cluster, CI2.Iteration, CI2.Cluster
SET @GraphVizLayout = @GraphVizLayout + '}'
PRINT @GraphVizLayout
BEGIN TRY
SET NOCOUNT ON
-- 1. Declare & populate data set with sample items for analysis
DECLARE @Items TABLE (
Item INT NOT NULL,
X FLOAT NOT NULL,
Y FLOAT NOT NULL
)
INSERT
@Items(Item, X, Y)
VALUES
(1, 26, 18),
(2, 9, 47),
(3, 25, 15),
(4, 36, 27),
(5, 35, 25),
(6, 10, 43),
(7, 11, 44),
(8, 36, 24),
(9, 26, 14),
(10, 26, 14),
(11, 9, 45),
(12, 33, 23),
(13, 27, 16),
(14, 10, 47)
-- 2. Precalculate distances between items
DECLARE @Distances TABLE (
Item1 INT,
Item2 INT,
Distance FLOAT
)
INSERT
@Distances(Item1, Item2, Distance)
SELECT
I1.Item, I2.Item, SQRT(POWER(I1.X - I2.X, 2) + POWER(I1.Y - I2.Y, 2))
FROM
@Items I1
CROSS JOIN
@Items I2
-- 3. On the first iteration, each item resides in its own cluster
DECLARE
@Iteration INT = 1
DECLARE @Clusters TABLE (
Iteration INT NOT NULL,
Cluster INT NOT NULL,
Item INT NOT NULL
)
INSERT
@Clusters (Iteration, Cluster, Item)
SELECT
@Iteration, Item, Item
FROM
@Items
-- 4. In cycle, merge two closest clusters, until all clusters will be merged into single one
DECLARE
@Cluster1 INT,
@Cluster2 INT
WHILE 1 = 1 BEGIN
-- 4.1. Determine two closest clusters
;WITH CTE AS (
SELECT
Cluster1 = CI1.Cluster,
Item1 = CI1.Item,
Cluster2 = CI2.Cluster,
Item2 = CI2.Item
FROM
@Clusters CI1
CROSS JOIN
@Clusters CI2
WHERE
CI1.Iteration = @Iteration
AND
CI2.Iteration = @Iteration
AND
CI1.Item <> CI2.Item
AND
CI1.Cluster <> CI2.Cluster
)
SELECT TOP 1
@Cluster1 = CTE.Cluster1,
@Cluster2 = CTE.Cluster2
FROM
CTE
INNER JOIN
@Distances CM ON CTE.Item1 = CM.Item1 AND CTE.Item2 = CM.Item2
GROUP BY
CTE.Cluster1, CTE.Cluster2
ORDER BY
AVG(CM.Distance)
-- 4.2. Stop processing if all clusters are merged
IF @@ROWCOUNT = 0
BREAK
SET @Iteration = @Iteration + 1
-- 4.3. Merge two closest clusters, leave all other as is
INSERT
@Clusters (Iteration, Cluster, Item)
SELECT
@Iteration,
CASE WHEN Cluster IN (@Cluster1, @Cluster2) THEN @Cluster1 ELSE Cluster END,
Item
FROM
@Clusters
WHERE
Iteration = @Iteration - 1
END
-- 5. Generate layout for GraphViz to show dendrogram
DECLARE
@GraphVizLayout VARCHAR(MAX)
SET @GraphVizLayout = 'digraph Clusters' + '{' + CHAR(13) + CHAR(10)
-- 5.1. Output graph nodes
SELECT
@GraphVizLayout = @GraphVizLayout + 'N' + CAST(Iteration * 100 + Cluster AS VARCHAR) + '[label = "N' + CAST(Iteration AS VARCHAR) + '.' + CAST(Cluster AS VARCHAR) + '"];' + CHAR(13) + CHAR(10)
FROM
@Clusters
GROUP BY
Iteration, Cluster
-- 5.2. Output graph edges
SELECT
@GraphVizLayout = @GraphVizLayout + 'N' + CAST(CI2.Iteration * 100 + CI2.Cluster AS VARCHAR) + ' -> N' + CAST(CI1.Iteration * 100 + CI1.Cluster AS VARCHAR) + ';' + CHAR(13) + CHAR(10)
FROM
@Clusters CI1
INNER JOIN
@Clusters CI2 ON CI1.Item = CI2.Item AND CI1.Iteration = CI2.Iteration + 1
GROUP BY
CI1.Iteration, CI1.Cluster, CI2.Iteration, CI2.Cluster
SET @GraphVizLayout = @GraphVizLayout + '}'
PRINT @GraphVizLayout
END TRY
BEGIN CATCH
BEGIN CATCH
DECLARE
@ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE(),
@ErrorSeverity INT = ERROR_SEVERITY(),
@ErrorState INT = ERROR_STATE()
RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)
@ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE(),
@ErrorSeverity INT = ERROR_SEVERITY(),
@ErrorState INT = ERROR_STATE()
RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)
END CATCH
How It Works
Input
First of all, we need to define data which will be used for analysis. In this script these items are defined directly in T-SQL code, but in the real world they most likely will be obtained from a database table.
Each item has two values which represent X and Y coordinate.
This is how these sample items look like on a scatterplot:
It is easy to notice that items can be grouped into three clusters. Let's see how clustering algorithm will group them.
Calculate distances
On the next step, script calculate distances between each and every item in data set. It is enough to calculate these values one - they will be used on the further steps when calculating distances between clusters.
The distance between two items indicates their similarity. There are a lot of different approaches for their calculation. In this script I am using so-called Euclidean distance.
Clustering
Now, let's see how clustering itself works.
On the first iteration, each item resides in its own cluster.
On the next iterations, the script determines two closest clusters and merges them together, leaving other clusters as is. The distances between clusters are calculated as average distance between items which are included in these clusters. This is so-called UPGMA approach.
The processing is stopped when script cannot identify two clusters to merge.
Output
Usually, results of hierarchical clustering are shown as a dendrogram. Unfortunately, SQL Server Management Studio does not have inbuilt viewer for graphs (like Spatial Viewer for geometry and geographical data). So I utilize third-party tool to draw dendrogram by using output of this script.
I am using GraphViz tool for this. This open-source software tool allows to visualize graphs defined as set of nodes and edges between them.
To generate dendrogram by using GraphViz, you need to store script output text in the file named cluster.gv and then run the following command:
dot.exe -Tpng cluster.gv -o cluster.png
Utility "dot.exe" draws directed graphs as PNG image.
In our case, the dendrogram looks like:
On the dendrogram each node represents separate cluster. Arrows indicate how clusters are built. The first part of node name indicates the iteration number and the second part indicate number of cluster.
As it has been already noted, on the first iteration each item resides in its own cluster and on the last iteration all items reside within sinlge cluster.
Iteration #12 represents situation when input items are grouped into three clusters - this is the best way to group these items from human point of view.
Conclusion
The advantage of hierarchical clustering method is that it is easy to understand and easy to implement.
The drawback of this method is that its output requires additional analysis. Another problem is that its perfromance is not very high, so it cannot be used for analysis of big data sets.
Комментариев нет:
Отправить комментарий