Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Probit model, collinearity from factor analysis data

    Hello everybody,

    I used predicted variables from PCA for an EFA and want to implement my findings (6 factors) in a probit model.

    What I did so far and what is planned:
    1. I cut my dataset into 4 sets
    2. I used PCA to trim down from my ~180 variables (I now have 8 components describing the items)
    3. I did not use the 8 components, however, realized the cut down items behind the components (around 50) to run an EFA on set 2
    4. After finding the underlying structure of 6 factors, I want to implement these findings on a third set to regress a probit model
    5. The last set of the 4 is for running the probit model.
    Now:
    I am somewhat stuck on how to implement the probit model.
    I did use "predict fa1 fa2 ... fa11" to get the new variables from my factor I found via the FA. Through "mkmat...., mat(probitraw) obs nchar(1)", "mat probitfa = probitraw*fa" and "svmat probitfa, names( col )" I was able to implement the structure / factors onto my new set 3.

    Now running probit on the list of variables found via this, plus some extra dummy variables, seems to have a problem.

    It shows:

    Note: 400 failures and 416 successes completely determined.

    Research suggested that this most likely comes from collinearity within my data. Since I was already stuck on collinearity I went back to use an orthogonal rotation instead of an oblique rotation to minimize correlation between factors. At least that was the plan.

    For this reason I ran vif (see below)
    . vif, unc
    Variable VIF 1/VIF
    Factor3 9.12e+06 0.000000
    Factor2 5.80e+06 0.000000
    Factor5 2.73e+06 0.000000
    Factor4 892711.06 0.000001
    Factor6 662939.75 0.000002
    Factor1 239988.48 0.000004
    usa 2.04 0.491378
    interconti~l 1.93 0.516836
    market_based 1.53 0.654417
    bank_based 1.23 0.810417
    bank_type 1.06 0.944274
    outliers 1.04 0.959722
    eastern_eu~e 1.01 0.993844
    Mean VIF 1.50e+06

    Now my question is, does it even make sense running a probit model on factors I found from an EFA.
    I found an old post where Mr Clyde Schechter and Mr Richard Williams talked about collinearity being of no issue, but my VIF is immensly high.

    On the one hand, I am worried that interpreting margins is not very sensible with such high correlation between variables.
    On the other hand, the whole sense of my thesis was to show how a multitude of variables can easily be summarized into very few factors, which in turn show the likelihood of a firm being a buyer or a target in an acquisition scenario.

    Additionally, are there other ways to work around this problem?
    Could implementing interaction terms be helpful, or do they just cover a deeper problem?


    Thank you in advance

    Best,

    Aaron

    P.S. I am not sure what other info you might need, please do not hesitate to detail this.

  • #2
    You didn't get a quick answer. You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. We don't even know what kind of model you're estimating.

    Personally, I am uncomfortable with analyses that do this, then do that, then do this, and finally come up with some variables. If you want EFA, why not do that on the original variables? You seem to be doing some things manually that Stata has built in - that is likely where the problem is. If you want factor scores, let Stata create them. It is extremely strange that any exploratory factor analysis would create such highly correlated factors.

    While strictly speaking, there are problems in using EFA and any subsequent analysis instead of SEM/GSEM, it is commonly done. Probit or regression should make no difference since the EFA scores are on the rhs (I assume).

    Comment


    • #3
      Thank you for your answer!

      Maybe I understood SEM incorrectly.
      I am trying to answer the question if a European merger of banks would make sense. Because of this I wanted to gather a lot of data from various variables, group them in various factors and then implement these in two probit models. I have data from banks that were buyers, and banks that were targets, where I ran seperate PCAs and EFAs. Now, in a new dataset, I implemented the dummy variable "Buyer" (1 for buyer, 0 for target) and ran a probit model on the factors found beforehand to get a likelihood of a bank being a buyer or a target, answering my initial question.

      Now from my understanding SEM helps with a confirmatory FA and basically provides a mechanism to proof the EFA being correct.
      I sadly do not understand how I can answer my question this way, but I am very thankful for any hints.

      Comment


      • #4
        My apologies as I did not give enough data or code:

        * tabulate the dependent variable to see the % of 0 (49.95%) and 1 (50.05%)
        tabulate Buyer

        *** TARGET ***

        * these were the items used in the target EFA of set 2
        global targetfa_v4 leverage_3 leverage_2 leverage_0 netloanratio_3 netloanratio_2 netloanratio_1 netloanratio_0 incpworker_3 incpworker_2 incpworker_1 debtliab_3 debtliab_2 debtliab_1 debtliab_0 costpworker_3 costpworker_2 costpworker_1 costpworker_0 shenl_1 shenl_0 tcr_2 tcr_1 tcr_0 noemploy_2 noemploy_1 noemploy_0

        * create a matrix out of the variables from the EFA_target with the name "probit_target"
        mkmat leverage_3 leverage_2 leverage_0 netloanratio_3 netloanratio_2 netloanratio_1 netloanratio_0 incpworker_3 incpworker_2 incpworker_1 debtliab_3 debtliab_2 debtliab_1 debtliab_0 costpworker_3 costpworker_2 costpworker_1 costpworker_0 shenl_1 shenl_0 tcr_2 tcr_1 tcr_0 noemploy_2 noemploy_1 noemploy_0, mat(probit_target) obs nchar(1)

        *matrix list probit_target

        matrix probit_tfa = probit_target*tfa

        svmat probit_tfa, names( col )

        * rename factors
        rename Factor1 impactpwork_tar
        rename Factor2 risk_tar
        rename Factor3 nlratios_tar
        rename Factor4 debtliab_tar
        rename Factor5 noemp_tar


        * scaling down the factors for easier interpretation (will standardize in future (?) --> research )
        gen impactpwork_tar_scale = impactpwork_tar /1000000
        gen risk_tar_scale = risk_tar /1000000
        gen nlratios_tar_scale = nlratios_tar /1000000
        gen debtliab_tar_scale = debtliab_tar /1000000
        gen noemp_tar_scale = noemp_tar /1000000

        global probit_efa_tar_scale impactpwork_tar_scale risk_tar_scale nlratios_tar_scale debtliab_tar_scale noemp_tar_scale

        global probit_tar_scale $probit_efa_tar_scale usa bank_based market_based eastern_europe outliers intercontinental bank_type

        summarize $probit_tar_scale
        Variable Obs Mean Std. Dev. Min Max
        impa~r_scale 911 -.0002679 .0014973 -.0176181 .0030146
        risk_tar_s~e 911 .0003944 .0015326 -.0001637 .0194293
        nlratios_t~e 911 .0001926 .0007915 -.001138 .0102846
        debtliab_t~e 911 -.0005534 .0021468 -.0262835 .0057064
        noemp_tar_~e 911 .0062253 .0237269 -.0000114 .2822747
        usa 911 .9077936 .289476 0 1
        bank_based 911 .0384193 .1923119 0 1
        market_based 911 .0153677 .123078 0 1
        eastern_eu~e 911 .0131723 .114075 0 1
        outliers 911 .0054885 .0739212 0 1
        interconti~l 911 .467618 .4992244 0 1
        bank_type 911 .0329308 .1785536 0 1
        probit Buyer $probit_tar_scale, iter(50)
        Iteration 0: log likelihood = -631.45653
        Iteration 1: log likelihood = -57.526732
        Iteration 2: log likelihood = -45.178593
        Iteration 3: log likelihood = -42.434548
        Iteration 4: log likelihood = -42.047746
        Iteration 5: log likelihood = -41.991733
        Iteration 6: log likelihood = -41.981123
        Iteration 7: log likelihood = -41.979161
        Iteration 8: log likelihood = -41.978875
        Iteration 9: log likelihood = -41.978815
        Iteration 10: log likelihood = -41.978801
        Iteration 11: log likelihood = -41.978798
        Note: 397 failures and 416 successes completely determined.
        margins, dydx(*) atmeans
        Delta-method
        dy/dx Std. Err. z P>z [95% Conf.
        impactpwork_tar_scale 94.63399 5744.818 0.02 0.987 -11165 11354.27
        risk_tar_scale 5570.629 338067 0.02 0.987 -657028.5 668169.8
        nlratios_tar_scale -6264.442 380172.2 -0.02 0.987 -751388.3 738859.4
        debtliab_tar_scale 937.9525 56922.25 0.02 0.987 -110627.6 112503.5
        noemp_tar_scale -63.15537 3832.889 -0.02 0.987 -7575.48 7449.169
        usa -2.183004 318.9329 -0.01 0.995 -627.2801 622.9141
        bank_based -4.484456 473.1112 -0.01 0.992 -931.7654 922.7965
        market_based -5.143864 460.9011 -0.01 0.991 -908.4935 898.2058
        eastern_europe -4.551445 471.7304 -0.01 0.992 -929.126 920.0231
        outliers -4.932443 464.4756 -0.01 0.992 -915.2879 905.423
        intercontinental -4.905878 464.9477 -0.01 0.992 -916.1867 906.3749
        bank_type .2455987 14.90663 0.02 0.987 -28.97087 29.46206
        margins, dydx(*)
        Delta-method
        dy/dx Std. Err. z P>z [95% Conf.
        impactpwork_tar_scale 6.252062 9.940346 0.63 0.529 -13.23066 25.73478
        risk_tar_scale 368.0276 212.2795 1.73 0.083 -48.03255 784.0877
        nlratios_tar_scale -413.8648 235.4274 -1.76 0.079 -875.2941 47.56444
        debtliab_tar_scale 61.9665 38.39995 1.61 0.107 -13.29602 137.229
        noemp_tar_scale -4.172404 3.291014 -1.27 0.205 -10.62267 2.277866
        usa -.1442217 16.103 -0.01 0.993 -31.70551 31.41707
        bank_based -.2962687 41.06387 -0.01 0.994 -80.77998 80.18744
        market_based -.339833 41.06385 -0.01 0.993 -80.82349 80.14383
        eastern_europe -.3006945 41.06387 -0.01 0.994 -80.7844 80.18301
        outliers -.3258654 41.06386 -0.01 0.994 -80.80955 80.15781
        intercontinental -.3241103 41.06385 -0.01 0.994 -80.80778 80.15956
        bank_type .0162257 .0178031 0.91 0.362 -.0186678 .0511191
        * for evaluing the goodness of fit (gof)
        fitstat
        Log-Lik Intercept Only: -631.457 Log-Lik Full Model: -41.979
        D(898): 83.958 LR(12): 1178.955
        Prob > LR: 0.000
        McFadden's R2: 0.934 McFadden's Adj R2: 0.913
        Maximum Likelihood R2: 0.726 Cragg & Uhler's R2: 0.968
        McKelvey and Zavoina's R2: 0.973 Efron's R2: 0.940
        Variance of y*: 37.480 Variance of error: 1.000
        Count R2: 0.978 Adj Count R2: 0.956
        AIC: 0.121 AIC*n: 109.958
        BIC: -6035.502 BIC': -1097.181
        * prediction of Buyer
        quietly probit Buyer $probit_tar_scale
        predict pprobit_tar, pr
        summarize Buyer pprobit_tar
        Variable Obs Mean Std. Dev. Min Max
        Buyer 911 .5005488 .5002743 0 1
        pprobit_tar 911 .5008816 .4844373 1.58e-10 1
        * % correctly predicted values
        quietly probit Buyer $probit_tar_scale
        estat classification
        Sensitivity Pr( + D) 98.90%
        Specificity Pr( -~D) 96.70%
        Positive predictive value Pr( D +) 96.78%
        Negative predictive value Pr(~D -) 98.88%
        False + rate for true ~D Pr( +~D) 3.30%
        False - rate for true D Pr( - D) 1.10%
        False + rate for classified + Pr(~D +) 3.22%
        False - rate for classified - Pr( D -) 1.12%
        Correctly classified 97.80%
        vif, unc
        Variable VIF 1/VIF
        risk_tar_s~e 1401.68 0.000713
        nlratios_t~e 454.44 0.002200
        noemp_tar_~e 391.94 0.002551
        debtliab_t~e 56.21 0.017791
        impa~r_scale 50.11 0.019958
        usa 2.11 0.473449
        interconti~l 2.00 0.499361
        market_based 1.28 0.778592
        bank_based 1.23 0.811194
        bank_type 1.06 0.940280
        eastern_eu~e 1.01 0.994878
        outliers 1.00 0.997546
        Mean VIF 197.01
        Now the following bother me personally:
        1. Probit regression: Note: 397 failures and 416 successes completely determined.
        2. fitstat: Variance of error: 1.000
        3. obviously the vif are incredibly high as stated before
        4. Is it helpful to standardize the variables? I assume it would help with interpreting margins
        5. And also, an additional part of my thesis was regarding different parts of europe, which I stated via dummy variables. Sadly these are insignificant and I don't really know how I can maybe work around this
          1. the option "robust" helped a lot here, I assume that extreme outliers in various regions made the dummys insignificant
        I ran similar code 3 times:
        Once for "target" and twice for "buyer". One time for buyer I did the factor analysis with principal factors and one time I did it with maximul likelihood. Since the results were somewhat different due to ml summarizing the data even more, I wanted to include both into differen probit models.

        I wanted to hold back my questions for these regressions since I hope that I will understand a lot myself after understanding more about the "target" regression


        Thank you in advance,

        Please tell me if you need any more information.


        Aaron
        Last edited by Aaron Nagel; 06 Oct 2019, 16:00.

        Comment


        • #5
          I found a mistake in a variable of mine, which pushed my R2 down to 3%.

          Thank you for your concern but until I am back at the level I was it could take some time.

          Comment

          Working...
          X