Probit model, collinearity from factor analysis data

Aaron Nagel

Join Date: Sep 2019
Posts: 11

Probit model, collinearity from factor analysis data

04 Oct 2019, 21:08

Hello everybody,

I used predicted variables from PCA for an EFA and want to implement my findings (6 factors) in a probit model.

What I did so far and what is planned:

I cut my dataset into 4 sets
I used PCA to trim down from my ~180 variables (I now have 8 components describing the items)
I did not use the 8 components, however, realized the cut down items behind the components (around 50) to run an EFA on set 2
After finding the underlying structure of 6 factors, I want to implement these findings on a third set to regress a probit model
The last set of the 4 is for running the probit model.

Now:
I am somewhat stuck on how to implement the probit model.
I did use "predict fa1 fa2 ... fa11" to get the new variables from my factor I found via the FA. Through "mkmat...., mat(probitraw) obs nchar(1)", "mat probitfa = probitraw*fa" and "svmat probitfa, names( col )" I was able to implement the structure / factors onto my new set 3.

Now running probit on the list of variables found via this, plus some extra dummy variables, seems to have a problem.

It shows:

Note: 400 failures and 416 successes completely determined.

Research suggested that this most likely comes from collinearity within my data. Since I was already stuck on collinearity I went back to use an orthogonal rotation instead of an oblique rotation to minimize correlation between factors. At least that was the plan.

For this reason I ran vif (see below)

. vif, unc
Variable	VIF	1/VIF

Factor3	9.12e+06	0.000000
Factor2	5.80e+06	0.000000
Factor5	2.73e+06	0.000000
Factor4	892711.06	0.000001
Factor6	662939.75	0.000002
Factor1	239988.48	0.000004
usa	2.04	0.491378
interconti~l	1.93	0.516836
market_based	1.53	0.654417
bank_based	1.23	0.810417
bank_type	1.06	0.944274
outliers	1.04	0.959722
eastern_eu~e	1.01	0.993844

Mean VIF	1.50e+06

Now my question is, does it even make sense running a probit model on factors I found from an EFA.
I found an old post where Mr Clyde Schechter and Mr Richard Williams talked about collinearity being of no issue, but my VIF is immensly high.

On the one hand, I am worried that interpreting margins is not very sensible with such high correlation between variables.
On the other hand, the whole sense of my thesis was to show how a multitude of variables can easily be summarized into very few factors, which in turn show the likelihood of a firm being a buyer or a target in an acquisition scenario.

Additionally, are there other ways to work around this problem?
Could implementing interaction terms be helpful, or do they just cover a deeper problem?

Thank you in advance

Best,

Aaron

P.S. I am not sure what other info you might need, please do not hesitate to detail this.

Tags: collinearity, EFA, probit

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

05 Oct 2019, 08:38

You didn't get a quick answer. You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. We don't even know what kind of model you're estimating.

Personally, I am uncomfortable with analyses that do this, then do that, then do this, and finally come up with some variables. If you want EFA, why not do that on the original variables? You seem to be doing some things manually that Stata has built in - that is likely where the problem is. If you want factor scores, let Stata create them. It is extremely strange that any exploratory factor analysis would create such highly correlated factors.

While strictly speaking, there are problems in using EFA and any subsequent analysis instead of SEM/GSEM, it is commonly done. Probit or regression should make no difference since the EFA scores are on the rhs (I assume).
Comment
Aaron Nagel

Join Date: Sep 2019

Posts: 11
#3

05 Oct 2019, 14:08

Thank you for your answer!

Maybe I understood SEM incorrectly.
I am trying to answer the question if a European merger of banks would make sense. Because of this I wanted to gather a lot of data from various variables, group them in various factors and then implement these in two probit models. I have data from banks that were buyers, and banks that were targets, where I ran seperate PCAs and EFAs. Now, in a new dataset, I implemented the dummy variable "Buyer" (1 for buyer, 0 for target) and ran a probit model on the factors found beforehand to get a likelihood of a bank being a buyer or a target, answering my initial question.

Now from my understanding SEM helps with a confirmatory FA and basically provides a mechanism to proof the EFA being correct.
I sadly do not understand how I can answer my question this way, but I am very thankful for any hints.
Comment

Aaron Nagel

Join Date: Sep 2019
Posts: 11

06 Oct 2019, 15:22

My apologies as I did not give enough data or code:

* tabulate the dependent variable to see the % of 0 (49.95%) and 1 (50.05%)
tabulate Buyer

*** TARGET ***

* these were the items used in the target EFA of set 2
global targetfa_v4 leverage_3 leverage_2 leverage_0 netloanratio_3 netloanratio_2 netloanratio_1 netloanratio_0 incpworker_3 incpworker_2 incpworker_1 debtliab_3 debtliab_2 debtliab_1 debtliab_0 costpworker_3 costpworker_2 costpworker_1 costpworker_0 shenl_1 shenl_0 tcr_2 tcr_1 tcr_0 noemploy_2 noemploy_1 noemploy_0

* create a matrix out of the variables from the EFA_target with the name "probit_target"
mkmat leverage_3 leverage_2 leverage_0 netloanratio_3 netloanratio_2 netloanratio_1 netloanratio_0 incpworker_3 incpworker_2 incpworker_1 debtliab_3 debtliab_2 debtliab_1 debtliab_0 costpworker_3 costpworker_2 costpworker_1 costpworker_0 shenl_1 shenl_0 tcr_2 tcr_1 tcr_0 noemploy_2 noemploy_1 noemploy_0, mat(probit_target) obs nchar(1)

*matrix list probit_target

matrix probit_tfa = probit_target*tfa

svmat probit_tfa, names( col )

* rename factors
rename Factor1 impactpwork_tar
rename Factor2 risk_tar
rename Factor3 nlratios_tar
rename Factor4 debtliab_tar
rename Factor5 noemp_tar

* scaling down the factors for easier interpretation (will standardize in future (?) --> research )
gen impactpwork_tar_scale = impactpwork_tar /1000000
gen risk_tar_scale = risk_tar /1000000
gen nlratios_tar_scale = nlratios_tar /1000000
gen debtliab_tar_scale = debtliab_tar /1000000
gen noemp_tar_scale = noemp_tar /1000000

global probit_efa_tar_scale impactpwork_tar_scale risk_tar_scale nlratios_tar_scale debtliab_tar_scale noemp_tar_scale

global probit_tar_scale $probit_efa_tar_scale usa bank_based market_based eastern_europe outliers intercontinental bank_type

summarize $probit_tar_scale

Variable	Obs	Mean	Std. Dev.	Min	Max

impa~r_scale	911	-.0002679	.0014973	-.0176181	.0030146
risk_tar_s~e	911	.0003944	.0015326	-.0001637	.0194293
nlratios_t~e	911	.0001926	.0007915	-.001138	.0102846
debtliab_t~e	911	-.0005534	.0021468	-.0262835	.0057064
noemp_tar_~e	911	.0062253	.0237269	-.0000114	.2822747

usa	911	.9077936	.289476	0	1
bank_based	911	.0384193	.1923119	0	1
market_based	911	.0153677	.123078	0	1
eastern_eu~e	911	.0131723	.114075	0	1
outliers	911	.0054885	.0739212	0	1

interconti~l	911	.467618	.4992244	0	1
bank_type	911	.0329308	.1785536	0	1

probit Buyer $probit_tar_scale, iter(50)

Iteration 0: log likelihood = -631.45653

Iteration 1: log likelihood = -57.526732

Iteration 2: log likelihood = -45.178593

Iteration 3: log likelihood = -42.434548

Iteration 4: log likelihood = -42.047746

Iteration 5: log likelihood = -41.991733

Iteration 6: log likelihood = -41.981123

Iteration 7: log likelihood = -41.979161

Iteration 8: log likelihood = -41.978875

Iteration 9: log likelihood = -41.978815

Iteration 10: log likelihood = -41.978801

Iteration 11: log likelihood = -41.978798

Note: 397 failures and 416 successes completely

determined.

margins, dydx(*) atmeans


	Delta-method
	dy/dx	Std. Err.	z	P>z	[95% Conf.

impactpwork_tar_scale	94.63399	5744.818	0.02	0.987	-11165	11354.27
risk_tar_scale	5570.629	338067	0.02	0.987	-657028.5	668169.8
nlratios_tar_scale	-6264.442	380172.2	-0.02	0.987	-751388.3	738859.4
debtliab_tar_scale	937.9525	56922.25	0.02	0.987	-110627.6	112503.5
noemp_tar_scale	-63.15537	3832.889	-0.02	0.987	-7575.48	7449.169
usa	-2.183004	318.9329	-0.01	0.995	-627.2801	622.9141
bank_based	-4.484456	473.1112	-0.01	0.992	-931.7654	922.7965
market_based	-5.143864	460.9011	-0.01	0.991	-908.4935	898.2058
eastern_europe	-4.551445	471.7304	-0.01	0.992	-929.126	920.0231
outliers	-4.932443	464.4756	-0.01	0.992	-915.2879	905.423
intercontinental	-4.905878	464.9477	-0.01	0.992	-916.1867	906.3749
bank_type	.2455987	14.90663	0.02	0.987	-28.97087	29.46206

margins, dydx(*)


	Delta-method
	dy/dx	Std. Err.	z	P>z	[95% Conf.

impactpwork_tar_scale	6.252062	9.940346	0.63	0.529	-13.23066	25.73478
risk_tar_scale	368.0276	212.2795	1.73	0.083	-48.03255	784.0877
nlratios_tar_scale	-413.8648	235.4274	-1.76	0.079	-875.2941	47.56444
debtliab_tar_scale	61.9665	38.39995	1.61	0.107	-13.29602	137.229
noemp_tar_scale	-4.172404	3.291014	-1.27	0.205	-10.62267	2.277866
usa	-.1442217	16.103	-0.01	0.993	-31.70551	31.41707
bank_based	-.2962687	41.06387	-0.01	0.994	-80.77998	80.18744
market_based	-.339833	41.06385	-0.01	0.993	-80.82349	80.14383
eastern_europe	-.3006945	41.06387	-0.01	0.994	-80.7844	80.18301
outliers	-.3258654	41.06386	-0.01	0.994	-80.80955	80.15781
intercontinental	-.3241103	41.06385	-0.01	0.994	-80.80778	80.15956
bank_type	.0162257	.0178031	0.91	0.362	-.0186678	.0511191

* for evaluing the goodness of fit (gof)
fitstat

Log-Lik Intercept Only:	-631.457	Log-Lik Full Model:	-41.979
D(898):	83.958	LR(12):	1178.955
	Prob > LR:	0.000
McFadden's R2:	0.934	McFadden's Adj R2:	0.913
Maximum Likelihood R2:	0.726	Cragg & Uhler's R2:	0.968
McKelvey and Zavoina's R2:	0.973	Efron's R2:	0.940
Variance of y*:	37.480	Variance of error:	1.000
Count R2:	0.978	Adj Count R2:	0.956
AIC:	0.121	AIC*n:	109.958
BIC:	-6035.502	BIC':	-1097.181

* prediction of Buyer
quietly probit Buyer $probit_tar_scale
predict pprobit_tar, pr
summarize Buyer pprobit_tar

Variable	Obs	Mean	Std. Dev.	Min	Max

Buyer	911	.5005488	.5002743	0	1
pprobit_tar	911	.5008816	.4844373	1.58e-10	1

* % correctly predicted values
quietly probit Buyer $probit_tar_scale
estat classification


Sensitivity	Pr( + D)	98.90%
Specificity	Pr( -~D)	96.70%
Positive predictive value	Pr( D +)	96.78%
Negative predictive value	Pr(~D -)	98.88%

False + rate for true ~D	Pr( +~D)	3.30%
False - rate for true D	Pr( - D)	1.10%
False + rate for classified +	Pr(~D +)	3.22%
False - rate for classified -	Pr( D -)	1.12%

Correctly classified		97.80%

vif, unc

Variable	VIF	1/VIF

risk_tar_s~e	1401.68	0.000713
nlratios_t~e	454.44	0.002200
noemp_tar_~e	391.94	0.002551
debtliab_t~e	56.21	0.017791
impa~r_scale	50.11	0.019958
usa	2.11	0.473449
interconti~l	2.00	0.499361
market_based	1.28	0.778592
bank_based	1.23	0.811194
bank_type	1.06	0.940280
eastern_eu~e	1.01	0.994878
outliers	1.00	0.997546

Mean VIF	197.01

Now the following bother me personally:

Probit regression: Note: 397 failures and 416 successes completely determined.
fitstat: Variance of error: 1.000
obviously the vif are incredibly high as stated before
Is it helpful to standardize the variables? I assume it would help with interpreting margins
And also, an additional part of my thesis was regarding different parts of europe, which I stated via dummy variables. Sadly these are insignificant and I don't really know how I can maybe work around this
1. the option "robust" helped a lot here, I assume that extreme outliers in various regions made the dummys insignificant

I ran similar code 3 times:
Once for "target" and twice for "buyer". One time for buyer I did the factor analysis with principal factors and one time I did it with maximul likelihood. Since the results were somewhat different due to ml summarizing the data even more, I wanted to include both into differen probit models.

I wanted to hold back my questions for these regressions since I hope that I will understand a lot myself after understanding more about the "target" regression

Thank you in advance,

Please tell me if you need any more information.

Aaron

Last edited by Aaron Nagel; 06 Oct 2019, 16:00.

Comment

Aaron Nagel

Join Date: Sep 2019

Posts: 11
#5

06 Oct 2019, 22:43

I found a mistake in a variable of mine, which pushed my R2 down to 3%.

Thank you for your concern but until I am back at the level I was it could take some time.
Comment

Announcement

Probit model, collinearity from factor analysis data

Comment

Comment

Comment

Comment