Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Outliers in cluster analysis

    Hi there!

    I'm carrying out (wards, using calinski stopping rule) cluster analysis on a basket of bank balance sheet variables to identify different types of bank behaviour and change overtime in Europe. However, when I run the clusters for a year, it might give me 3 clusters as the preferred (according to the pseudo-F) but one of the clusters just has 1 bank in it, whilst the other 2 have hundreds. I can exclude it, but then the next time run the cluster there will be a cluster with just 1 or 2 banks compared to the others.

    Does anyone here know if these cases would likely be clearly outliers I need to deal with prior to running the cluster commands (and what might be some recommended outlier identifying methods?), or this always likely to occur when running cluster analysis?

    General details - sample size is off approx.3000 banks, for the years 2010-2016 inclusive being clustered against a basket off 8 balance sheet variables - data is in long format, no missing data).

    Many thanks for any insight you can offer, it's much appreciated,

    Olly

  • #2
    I wonder whether there are standardized or transformed variables. If so, this may be the cause of such a pattern.
    Best regards,

    Marcos

    Comment


    • #3
      Hi Marcos,

      All my variables have been standardised to reduce the bias off different scales between variables. This seemed to make a big difference in terms of actually reaching a pseudo-F score I could use (before they just went on beyond 15 clusters), though with or without standardisation I have had clusters appear with just one bank substantiating the whole cluster.

      Intrigued to know how standardisation might affect my results beyond reducing unit bias though!

      Many thanks

      Olly

      Comment


      • #4
        This information we find in the Stata Manual:

        Stata’s cluster command has no built-in data transformations, but because Stata has full data
        management and statistical capabilities, you can use other Stata commands to transform your data
        before calling the cluster command. Standardizing the variables is sometimes important to keep
        a variable with high variability from dominating the cluster analysis. In other cases, standardizing
        variables hides the true groupings present in the data. The decision to standardize or perform other
        data transformations depends on the type of data and the nature of the groups.
        Data transformations (such as standardization of variables) and the variables selected for use in
        clustering can also greatly affect the groupings that are discovered
        .
        Best regards,

        Marcos

        Comment

        Working...
        X