Outliers in cluster analysis

Olly Rice

Join Date: Apr 2018

Posts: 11
#1

Outliers in cluster analysis

02 Apr 2018, 05:35

Hi there!

I'm carrying out (wards, using calinski stopping rule) cluster analysis on a basket of bank balance sheet variables to identify different types of bank behaviour and change overtime in Europe. However, when I run the clusters for a year, it might give me 3 clusters as the preferred (according to the pseudo-F) but one of the clusters just has 1 bank in it, whilst the other 2 have hundreds. I can exclude it, but then the next time run the cluster there will be a cluster with just 1 or 2 banks compared to the others.

Does anyone here know if these cases would likely be clearly outliers I need to deal with prior to running the cluster commands (and what might be some recommended outlier identifying methods?), or this always likely to occur when running cluster analysis?

General details - sample size is off approx.3000 banks, for the years 2010-2016 inclusive being clustered against a basket off 8 balance sheet variables - data is in long format, no missing data).

Many thanks for any insight you can offer, it's much appreciated,

Olly
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

02 Apr 2018, 06:25

I wonder whether there are standardized or transformed variables. If so, this may be the cause of such a pattern.

Best regards,

Marcos
Comment
Olly Rice

Join Date: Apr 2018

Posts: 11
#3

02 Apr 2018, 10:38

Hi Marcos,

All my variables have been standardised to reduce the bias off different scales between variables. This seemed to make a big difference in terms of actually reaching a pseudo-F score I could use (before they just went on beyond 15 clusters), though with or without standardisation I have had clusters appear with just one bank substantiating the whole cluster.

Intrigued to know how standardisation might affect my results beyond reducing unit bias though!

Many thanks

Olly
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

02 Apr 2018, 14:05

This information we find in the Stata Manual:

Stata’s cluster command has no built-in data transformations, but because Stata has full data
management and statistical capabilities, you can use other Stata commands to transform your data
before calling the cluster command. Standardizing the variables is sometimes important to keep
a variable with high variability from dominating the cluster analysis. In other cases, standardizing
variables hides the true groupings present in the data. The decision to standardize or perform other
data transformations depends on the type of data and the nature of the groups.
Data transformations (such as standardization of variables) and the variables selected for use in
clustering can also greatly affect the groupings that are discovered.

Best regards,

Marcos
Comment

Announcement

Outliers in cluster analysis

Comment

Comment

Comment