Iurii Marymonskyi's Tech Blog: Hierarchical clustering by using T-SQL

Introduction

In this blog post, I would like to show how to implement hierarchical data clustering by using T-SQL.

Hierarchical clustering is probably the simplest method of cluster analysis. The are two basic types of this method - divisive (processing begins from the single cluster and on each step clusters are divided) and agglomerative (on the beginning each item resides in its own cluster, on the next steps we merge two closest clusters until all clusters are merged into single one).

The script below utilizes agglomerative approach.

Code

BEGIN TRY SET NOCOUNT ON -- 1. Declare & populate data set with sample items for analysis DECLARE @Items TABLE ( Item INT NOT NULL, X FLOAT NOT NULL, Y FLOAT NOT NULL ) INSERT @Items(Item, X, Y) VALUES (1, 26, 18), (2, 9, 47), (3, 25, 15), (4, 36, 27), (5, 35, 25), (6, 10, 43), (7, 11, 44), (8, 36, 24), (9, 26, 14), (10, 26, 14), (11, 9, 45), (12, 33, 23), (13, 27, 16), (14, 10, 47) -- 2. Precalculate distances between items DECLARE @Distances TABLE ( Item1 INT, Item2 INT, Distance FLOAT ) INSERT @Distances(Item1, Item2, Distance) SELECT I1.Item, I2.Item, SQRT(POWER(I1.X - I2.X, 2) + POWER(I1.Y - I2.Y, 2)) FROM @Items I1 CROSS JOIN @Items I2 -- 3. On the first iteration, each item resides in its own cluster DECLARE @Iteration INT = 1 DECLARE @Clusters TABLE ( Iteration INT NOT NULL, Cluster INT NOT NULL, Item INT NOT NULL ) INSERT @Clusters (Iteration, Cluster, Item) SELECT @Iteration, Item, Item FROM @Items -- 4. In cycle, merge two closest clusters, until all clusters will be merged into single one DECLARE @Cluster1 INT, @Cluster2 INT WHILE 1 = 1 BEGIN -- 4.1. Determine two closest clusters ;WITH CTE AS ( SELECT Cluster1 = CI1.Cluster, Item1 = CI1.Item, Cluster2 = CI2.Cluster, Item2 = CI2.Item FROM @Clusters CI1 CROSS JOIN @Clusters CI2 WHERE CI1.Iteration = @Iteration AND CI2.Iteration = @Iteration AND CI1.Item <> CI2.Item AND CI1.Cluster <> CI2.Cluster ) SELECT TOP 1 @Cluster1 = CTE.Cluster1, @Cluster2 = CTE.Cluster2 FROM CTE INNER JOIN @Distances CM ON CTE.Item1 = CM.Item1 AND CTE.Item2 = CM.Item2 GROUP BY CTE.Cluster1, CTE.Cluster2 ORDER BY AVG(CM.Distance) -- 4.2. Stop processing if all clusters are merged IF @@ROWCOUNT = 0 BREAK SET @Iteration = @Iteration + 1 -- 4.3. Merge two closest clusters, leave all other as is INSERT @Clusters (Iteration, Cluster, Item) SELECT @Iteration, CASE WHEN Cluster IN (@Cluster1, @Cluster2) THEN @Cluster1 ELSE Cluster END, Item FROM @Clusters WHERE Iteration = @Iteration - 1 END -- 5. Generate layout for GraphViz to show dendrogram DECLARE @GraphVizLayout VARCHAR(MAX) SET @GraphVizLayout = 'digraph Clusters' + '{' + CHAR(13) + CHAR(10) -- 5.1. Output graph nodes SELECT @GraphVizLayout = @GraphVizLayout + 'N' + CAST(Iteration * 100 + Cluster AS VARCHAR) + '[label = "N' + CAST(Iteration AS VARCHAR) + '.' + CAST(Cluster AS VARCHAR) + '"];' + CHAR(13) + CHAR(10) FROM @Clusters GROUP BY Iteration, Cluster -- 5.2. Output graph edges SELECT @GraphVizLayout = @GraphVizLayout + 'N' + CAST(CI2.Iteration * 100 + CI2.Cluster AS VARCHAR) + ' -> N' + CAST(CI1.Iteration * 100 + CI1.Cluster AS VARCHAR) + ';' + CHAR(13) + CHAR(10) FROM @Clusters CI1 INNER JOIN @Clusters CI2 ON CI1.Item = CI2.Item AND CI1.Iteration = CI2.Iteration + 1 GROUP BY CI1.Iteration, CI1.Cluster, CI2.Iteration, CI2.Cluster SET @GraphVizLayout = @GraphVizLayout + '}' PRINT @GraphVizLayout

`END TRY`

`BEGIN CATCH`

`DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE(), @ErrorSeverity INT = ERROR_SEVERITY(), @ErrorState INT = ERROR_STATE() RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)`

`END CATCH`

How It Works

Input

First of all, we need to define data which will be used for analysis. In this script these items are defined directly in T-SQL code, but in the real world they most likely will be obtained from a database table.

Each item has two values which represent X and Y coordinate.

This is how these sample items look like on a scatterplot:

It is easy to notice that items can be grouped into three clusters. Let's see how clustering algorithm will group them.

Calculate distances

On the next step, script calculate distances between each and every item in data set. It is enough to calculate these values one - they will be used on the further steps when calculating distances between clusters.

The distance between two items indicates their similarity. There are a lot of different approaches for their calculation. In this script I am using so-called Euclidean distance.

Clustering

Now, let's see how clustering itself works.

On the first iteration, each item resides in its own cluster.

On the next iterations, the script determines two closest clusters and merges them together, leaving other clusters as is. The distances between clusters are calculated as average distance between items which are included in these clusters. This is so-called UPGMA approach.

The processing is stopped when script cannot identify two clusters to merge.

Output

Usually, results of hierarchical clustering are shown as a dendrogram. Unfortunately, SQL Server Management Studio does not have inbuilt viewer for graphs (like Spatial Viewer for geometry and geographical data). So I utilize third-party tool to draw dendrogram by using output of this script.

I am using GraphViz tool for this. This open-source software tool allows to visualize graphs defined as set of nodes and edges between them.

To generate dendrogram by using GraphViz, you need to store script output text in the file named cluster.gv and then run the following command:

dot.exe -Tpng cluster.gv -o cluster.png

Utility "dot.exe" draws directed graphs as PNG image.

In our case, the dendrogram looks like:

On the dendrogram each node represents separate cluster. Arrows indicate how clusters are built. The first part of node name indicates the iteration number and the second part indicate number of cluster.

As it has been already noted, on the first iteration each item resides in its own cluster and on the last iteration all items reside within sinlge cluster.

Iteration #12 represents situation when input items are grouped into three clusters - this is the best way to group these items from human point of view.

Conclusion

The advantage of hierarchical clustering method is that it is easy to understand and easy to implement.

The drawback of this method is that its output requires additional analysis. Another problem is that its perfromance is not very high, so it cannot be used for analysis of big data sets.

Iurii Marymonskyi's Tech Blog

воскресенье, 17 февраля 2013 г.

Hierarchical clustering by using T-SQL

Introduction

Code

`END TRY`

`BEGIN CATCH`

`DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE(), @ErrorSeverity INT = ERROR_SEVERITY(), @ErrorState INT = ERROR_STATE() RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)`

`END CATCH`

How It Works

Input

Calculate distances

Clustering

Output

Conclusion

Комментариев нет:

Отправить комментарий

воскресенье, 17 февраля 2013 г.

Hierarchical clustering by using T-SQL

Introduction

Code

END TRY

BEGIN CATCH

DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE(), @ErrorSeverity INT = ERROR_SEVERITY(), @ErrorState INT = ERROR_STATE() RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)

END CATCH

How It Works

Input

Calculate distances

Clustering

Output

Conclusion

Комментариев нет:

Отправить комментарий

воскресенье, 17 февраля 2013 г.

`END TRY`

`BEGIN CATCH`

`DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE(), @ErrorSeverity INT = ERROR_SEVERITY(), @ErrorState INT = ERROR_STATE() RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)`

`END CATCH`