Moving beyond Table 1: Clustering methods (Discussion Group March 31)

Note that the images are taken from Sophie’s presentation and should not be re-produced.

Table 1 is the cornerstone of lots of epidemiology papers, but Sophie provided us with an alternative descriptive analysis option that is often overlooked. Clustering methods allow us to visualize where in space or time observations tend to, well, cluster, without any need to stratify. She introduced the k-means and hierarchical agglomerative clustering approaches, before demonstrating an application in her former master’s thesis and sparking a discussion on these methods’ uses.

k-means-pic jpeg

The k-means approach first requires pre-specifying the number of clusters you are after. The centre (or mean) of each group is then determined, and the observations assigned to a given group are those ‘closest’ to this mean.  ‘Closest’ can be measured in several ways, for example Euclidean (meters-based) or Frechet (the distance between curves). Through several iterations of assigning points to different groups, the procedure uses expectation maximization that repeats until stable to identify clusters.

The hierarchical agglomerative clustering approach does not require pre-specifying clusters, as each observation is initially treated as its own cluster.  A tree is then created, joining the ‘most similar’ observations at each branch.  We keep doing so until there is only one cluster. We then choose to ‘cut’ the tree at the most sensible point; if we cut it near the start, we end up with more clusters, and vice versa.    Like with k-ha-cluster-2means, ‘similarity’ can be defined in many ways, for example, at each tree branch data may be combined so that the new clusters have the smallest average distance.

Sophie provided an example based on 10 years of smoke rates in Glasgow districts.  She clustered based on minimizing the distance between the shapes of the curves, and thus could visualize the groups of districts with different time trends in smoking rates.  She found there was one group with a steady decline, another with a decline followed by a spike (that coincided with the economic crisis), and finally one that was stable over time.  Interestingly, these overlapped with socio-economic status.

Since these methods contrast with the causal, parametric focus of most of our classes, the discussion centred on their use.  Key to these clustering approaches is that there are no distributional assumptions, means to calculate error or ‘significance’ testing.  It is exploratory and hypothesis-generating (but can still be a paper on its own!). Perhaps they fit into the surveillance side of epidemiology, which we know little about; if we repeated clustering over time we could monitor changes to disease rates, for example. Regardless, this presentation enabled us to open our minds to a method few had seen, and consider new ways to describe data that move beyond the typical Table 1.

The full presentation is available here.