воскресенье, 17 февраля 2013 г.

Hierarchical clustering by using T-SQL

Introduction

 
 
In this blog post, I would like to show how to implement hierarchical data clustering by using T-SQL.
 
Hierarchical clustering is probably the simplest method of cluster analysis. The are two basic types of this method - divisive (processing begins from the single cluster and on each step clusters are divided) and agglomerative (on the beginning each item resides in its own cluster, on the next steps we merge two closest clusters until all clusters are merged into single one).
 
The script below utilizes agglomerative approach.

Code

 


BEGIN TRY
  
SET NOCOUNT ON

  
-- 1. Declare & populate data set with sample items for analysis
  
DECLARE @Items TABLE (
      
Item INT NOT NULL,
      
X FLOAT NOT NULL,
      
Y FLOAT NOT NULL
   )

  
INSERT
      
@Items(Item, X, Y)
  
VALUES
      
(1, 26, 18),
       (
2, 9, 47),
       (
3, 25, 15),
       (
4, 36, 27),
       (
5, 35, 25),
       (
6, 10, 43),
       (
7, 11, 44),
       (
8, 36, 24),
       (
9, 26, 14),
       (
10, 26, 14),
       (
11, 9, 45),
       (
12, 33, 23),
       (
13, 27, 16),
       (
14, 10, 47)
      
  
-- 2. Precalculate distances between items
  
DECLARE @Distances TABLE (
      
Item1 INT,
      
Item2 INT,
      
Distance FLOAT
  
)

  
INSERT
      
@Distances(Item1, Item2, Distance)
  
SELECT
      
I1.Item, I2.Item, SQRT(POWER(I1.X - I2.X, 2) + POWER(I1.Y - I2.Y, 2))
  
FROM
      
@Items I1
  
CROSS JOIN
      
@Items I2

  
-- 3. On the first iteration, each item resides in its own cluster
  
DECLARE
      
@Iteration INT = 1

  
DECLARE @Clusters TABLE (
      
Iteration INT NOT NULL,
      
Cluster INT NOT NULL,
      
Item INT NOT NULL
   )

  
INSERT
      
@Clusters (Iteration, Cluster, Item)
  
SELECT
      
@Iteration, Item, Item
  
FROM
      
@Items

  
-- 4. In cycle, merge two closest clusters, until all clusters will be merged into single one
  
DECLARE
      
@Cluster1 INT,
      
@Cluster2 INT

   WHILE
1 = 1 BEGIN
      
-- 4.1. Determine two closest clusters
      
;WITH CTE AS (
          
SELECT
              
Cluster1 = CI1.Cluster,
              
Item1 = CI1.Item,
              
Cluster2 = CI2.Cluster,
              
Item2 = CI2.Item
          
FROM
              
@Clusters CI1
          
CROSS JOIN
              
@Clusters CI2
          
WHERE
              
CI1.Iteration = @Iteration
                  
AND
              
CI2.Iteration = @Iteration
                  
AND
              
CI1.Item <> CI2.Item
                  
AND
              
CI1.Cluster <> CI2.Cluster
      
)
      
SELECT TOP 1
          
@Cluster1 = CTE.Cluster1,
          
@Cluster2 = CTE.Cluster2
      
FROM
          
CTE
      
INNER JOIN
          
@Distances CM ON CTE.Item1 = CM.Item1 AND CTE.Item2 = CM.Item2
      
GROUP BY
          
CTE.Cluster1, CTE.Cluster2
      
ORDER BY
          
AVG(CM.Distance)

      
-- 4.2. Stop processing if all clusters are merged
      
IF @@ROWCOUNT = 0
          
BREAK

       SET
@Iteration = @Iteration + 1

      
-- 4.3. Merge two closest clusters, leave all other as is
      
INSERT
          
@Clusters (Iteration, Cluster, Item)
      
SELECT
          
@Iteration,
          
CASE WHEN Cluster IN (@Cluster1, @Cluster2) THEN @Cluster1 ELSE Cluster END,
          
Item
      
FROM
          
@Clusters
      
WHERE
          
Iteration = @Iteration - 1
  
END

  
-- 5. Generate layout for GraphViz to show dendrogram
  
DECLARE
      
@GraphVizLayout VARCHAR(MAX)

  
SET @GraphVizLayout = 'digraph Clusters' + '{' + CHAR(13) + CHAR(10)

  
-- 5.1. Output graph nodes
  
SELECT
      
@GraphVizLayout = @GraphVizLayout + 'N' + CAST(Iteration * 100 + Cluster AS VARCHAR) + '[label = "N' + CAST(Iteration AS VARCHAR) + '.' + CAST(Cluster AS VARCHAR) + '"];' + CHAR(13) + CHAR(10)
  
FROM
      
@Clusters
  
GROUP BY
      
Iteration, Cluster

  
-- 5.2. Output graph edges
  
SELECT
      
@GraphVizLayout = @GraphVizLayout + 'N' + CAST(CI2.Iteration * 100 + CI2.Cluster AS VARCHAR) + ' -> N' + CAST(CI1.Iteration * 100 + CI1.Cluster AS VARCHAR) + ';' + CHAR(13) + CHAR(10)
  
FROM
      
@Clusters CI1
  
INNER JOIN
      
@Clusters CI2 ON CI1.Item = CI2.Item AND CI1.Iteration = CI2.Iteration + 1
  
GROUP BY
      
CI1.Iteration, CI1.Cluster, CI2.Iteration, CI2.Cluster

  
SET @GraphVizLayout = @GraphVizLayout + '}'

  
PRINT @GraphVizLayout

END TRY


BEGIN CATCH


   DECLARE
      
@ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE(),
      
@ErrorSeverity INT = ERROR_SEVERITY(),
      
@ErrorState INT = ERROR_STATE()

  
RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)

END CATCH

 

 

How It Works

 

Input

 
First of all, we need to define data which will be used for analysis. In this script these items are defined directly in T-SQL code, but in the real world they most likely will be obtained from a database table.
 
Each item has two values which represent X and Y coordinate.
 
This is how these sample items look like on a scatterplot:


It is easy to notice that items can be grouped into three clusters. Let's see how clustering algorithm will group them.

 

Calculate distances

 
On the next step, script calculate distances between each and every item in data set. It is enough to calculate these values one - they will be used on the further steps when calculating distances between clusters.
 
The distance between two items indicates their similarity. There are a lot of different approaches for their calculation. In this script I am using so-called Euclidean distance.
 

 

Clustering

 
Now, let's see how clustering itself works.
 
On the first iteration, each item resides in its own cluster.
 
On the next iterations, the script determines two closest clusters and merges them together, leaving other clusters as is. The distances between clusters are calculated as average distance between items which are included in these clusters. This is so-called UPGMA approach.
 
The processing is stopped when script cannot identify two clusters to merge.

 

Output

 
Usually, results of hierarchical clustering are shown as a dendrogram. Unfortunately, SQL Server Management Studio does not have inbuilt viewer for graphs (like Spatial Viewer for geometry and geographical data). So I utilize third-party tool to draw dendrogram by using output of this script.
 
I am using GraphViz tool for this. This open-source software tool allows to visualize graphs defined as set of nodes and edges between them.
 
To generate dendrogram by using GraphViz, you need to store script output text in the file named cluster.gv and then run the following command:
 


dot.exe -Tpng cluster.gv -o cluster.png

 
Utility "dot.exe" draws directed graphs as PNG image.
 
In our case, the dendrogram looks like:
 
 
On the dendrogram each node represents separate cluster. Arrows indicate how clusters are built. The first part of node name indicates the iteration number and the second part indicate number of cluster.
 
As it has been already noted, on the first iteration each item resides in its own cluster and on the last iteration all items reside within sinlge cluster.
 
Iteration #12 represents situation when input items are grouped into three clusters - this is the best way to group these items from human point of view.

 

Conclusion

 
 
The advantage of hierarchical clustering method is that it is easy to understand and easy to implement.
 
The drawback of this method is that its output requires additional analysis. Another problem is that its perfromance is not very high, so it cannot be used for analysis of big data sets.

Комментариев нет:

Отправить комментарий