thanks]]>

tabstat

Thank you.]]>

Code:

duplicates tag sales assets year, generate(dups)

If I sort by sales (or assets) and list the data, I can clearly see there are many pairs of firms that have the same values in the same year. I get something like this:

Company ID Sales Assets Year dups

1 6 25 1999 1

2 6 25 1999 1

3 10 100 1999 1

4 10 100 1999 1

1 3.5 45 2000 1

2 3.5 45 2000 1

3 1 50 2000 1

4 1 50 2000 1

To correct for double counting, I'd like to keep one firm and drop the other. I've seen code for create an alternating dummy value each year, which would work in my fictional list I've provided. In my actual (more complicated) data, this would not work (I would end up dropping some years for Company A and some years for Company B, when it would be better to just drop one or the other). That is, I need to create a dummy variable where 0 equals one of the firms with duplicate data and 1 equals the other. Or, in line with the fictional data I have, I want to tell Stata that firm 1 has the same values as firm 2 in all years and firm 3 has the same values as firm 4 in all years too.

I hope this is clear. Any help anyone has is appreciated.

]]>

When I run FE (with robust) and OLS on my data (analysing the effect on real office rents) I get the same result. I was under the assumption that the model should break down with OLS, as the data is panel. How can I explain this intuitively?

Code:

xtreg log_rentgrowth weworkspace L.weworkspace L2.weworkspace mo3_officeemploychg L.log_vacratechg log_wage_allindustries_chg demandchg netcompletions_chg, fe robust

Code:

regress log_rentgrowth weworkspace L.weworkspace L2.weworkspace mo3_officeemploychg L.log_vacratechg log_wage_allindustries_chg demandchg netcompletions_chg, robust

where weworkspace is when a wework office opens (the square footage is the effect)

Note that above I have adjusted the variable names to make them easier to understand on this forum

Code:

Fixed-effects (within) regression Number of obs = 700 Group variable: DISTRICT Number of groups = 20 R-sq: Obs per group: within = 0.2576 min = 35 between = 0.0090 avg = 35.0 overall = 0.2538 max = 35 F(8,19) = 994.07 corr(u_i, Xb) = -0.0046 Prob > F = 0.0000 (Std. Err. adjusted for 20 clusters in DISTRICT) ---------------------------------------------------------------------------------------- | Robust log_ratio | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------------+---------------------------------------------------------------- weworkspacediv100000 | --. | .0069729 .0028635 2.44 0.025 .0009795 .0129663 L1. | .0027897 .0017442 1.60 0.126 -.0008611 .0064404 L2. | .0063592 .0030579 2.08 0.051 -.000041 .0127594 | mo3_proemplchg | .7043324 .0348203 20.23 0.000 .6314527 .7772122 | log_vacrate | L1. | -.0115143 .0079909 -1.44 0.166 -.0282394 .0052108 | log_wage_allindustries | .024448 .007125 3.43 0.003 .0095353 .0393607 demandchg | .179892 .1012628 1.78 0.092 -.0320535 .3918375 netcompletions_chg | -.2701654 .1072092 -2.52 0.021 -.4945567 -.0457741 _cons | .0006551 .0002466 2.66 0.016 .0001391 .0011712 -----------------------+---------------------------------------------------------------- sigma_u | .0030584 sigma_e | .02169764 rho | .01948141 (fraction of variance due to u_i) ----------------------------------------------------------------------------------------

Code:

Linear regression Number of obs = 700 F(8, 691) = 25.33 Prob > F = 0.0000 R-squared = 0.2539 Root MSE = .02161 ---------------------------------------------------------------------------------------- | Robust log_ratio | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------------+---------------------------------------------------------------- weworkspacediv100000 | --. | .0065003 .0029224 2.22 0.026 .0007624 .0122382 L1. | .0023713 .0023824 1.00 0.320 -.0023063 .0070488 L2. | .0057454 .0028047 2.05 0.041 .0002386 .0112522 | mo3_proemplchg | .7042677 .0530478 13.28 0.000 .6001134 .8084219 | log_vacrate | L1. | -.011783 .0044797 -2.63 0.009 -.0205786 -.0029875 | log_wage_allindustries | .0244981 .0188294 1.30 0.194 -.0124715 .0614678 demandchg | .1870332 .0835134 2.24 0.025 .0230628 .3510036 netcompletions_chg | -.2834789 .1595353 -1.78 0.076 -.596711 .0297531 _cons | .000725 .0009704 0.75 0.455 -.0011803 .0026303 ----------------------------------------------------------------------------------------

]]>

Also how do i test for serial correlation and heteroskedasticity with panel data in stata? Varying sources are telling me different things. How would I then resolve these issues? Will it affect whether I can used e.g. fixed effects?

Any help would be greatly appreciated. ]]>

My dataset consists of individual level data on university graduates grades (a categorical variable, i.e. first, upper second, lower second, third), along with various information on their background characteristics, i.e. social class, gender, ethnicity.

I am looking to make a graph to describe my data. In particular I would like to create a bar chart telling me the proportion of students in each social class achieving the different categories of degree class.

If anyone knows how to do this please let me know.

Thank you. ]]>

From my thesis advisor I got the feedback: Try to analyze the data of cash compensation and show some trend in time, in industry.

Therefore I thought I create groups and then show summary statistics per group.

Code:

tabstat Severancepayment, statistics(mean median) by(sub_sectors)

Code:

too many variables specified

Code:

tostring StandardIndustryClassification, gen(sicIndustry) format(%04.0f) gen sic = substr(sicIndustry,-4,2) egen sub_sectors=group(sic) sort sub_sectors gen Argicultural_Forestry_Fishing=(sub_sectors==1) gen Mining=(sub_sectors==2 & sub_sectors==3 & sub_sectors==21) gen Manufacturing=(sub_sectors>=4 & sub_sectors<=20) gen Wholesale_trade=(sub_sectors==22 & sub_sectors==23) gen Retail_trade=(sub_sectors>=24 & sub_sectors<=27) gen Finance_Insurance_Real estate=(sub_sectors==28) gen Services=(sub_sectors=>29 & sub_sectors<=35)

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int YearMatch str4 Year int StandardIndustryClassification double(Salary Bonus Severancepayment) 2000 "2000" 2834 765.769 775 4622.307 2004 "2004" 7389 433.138 240.903 0 2003 "2003" 3661 451.731 495.789 2842.56 1998 "1998" 2834 395 230 0 2002 "2002" 3663 517 114 0 1997 "1997" 5411 845.833 418.281 4300 2000 "2000" 7990 450 450 0 2006 "2006" 3357 633.1 392.522 4273.425 2003 "2003" 3821 776.475 0 0 1997 "1997" 5912 450 0 1350 2003 "2003" 7990 625 500.337 1875 1998 "1998" 3312 594.166 0 1782.498 1999 "1999" 3714 708.173 665.683 5152.095 1998 "1998" 3576 525.732 525 1576.098 1998 "1998" 2300 1044.336 0 0 2000 "2000" 7372 1200 2800 30600 1996 "1996" 7373 400 0 0 1997 "1997" 3841 233.333 45.678 300 2000 "2000" 5122 675 245.9 2753.491 1997 "1997" 2890 574.997 245 2459.991 2002 "2002" 2836 810 800 0 2006 "2006" 2835 574.304 36.143 1156.2 1997 "1997" 7372 375.394 0 400 1999 "1999" 3674 350 639.396 3850 2006 "2006" 3663 415.35 0 1246.0500000000002 2003 "2003" 3357 600 60.885 2116.327 2003 "2003" 7990 998.462 1225 8995.386 2005 "2005" 7372 445.747 0 333.7 1999 "1999" 3661 340.018 103.972 841.52 1998 "1998" 2836 496.923 354 0 2006 "2006" 7389 520 589.16 0 2001 "2001" 2860 512.629 202.2 2397.815 2002 "2002" 2451 330 1670 0 1998 "1998" 3842 309.04 138.376 894.8320000000001 2000 "2000" 3571 1234.641 3806.33 0 2004 "2004" 7372 380.015 0 619.4 2000 "2000" 2911 1200 3000 0 1999 "1999" 3320 735 1029 12900 2004 "2004" 3569 494.766 800.8 4159.925 1998 "1998" 1000 1133.524 975 8029.445 2006 "2006" 3826 575 0 4125 1998 "1998" 3823 250 157.5 1222.5 1998 "1998" 3570 750 0 32250 2000 "2000" 2020 735 124.95 1719.9 2005 "2005" 100 348 420 1880 1999 "1999" 2834 692.5 539.63 2543 1997 "1997" 3570 901.939 100 6450 1997 "1997" 1382 841.667 1037 0 2006 "2006" 3812 650 0 4133.4 2000 "2000" 2050 675 113.4 2805.444 1997 "1997" 3714 283.33 300 1749.9899999999998 2001 "2001" 3674 295.385 76.475 0 2004 "2004" 5945 531.135 390 800 2003 "2003" 1311 325 400 2175 1996 "1996" 2211 531.599 0 1594.797 2005 "2005" 7372 546 301.938 0 2005 "2005" 5040 1100 1100 6600 1999 "1999" 2621 990 698.465 5065.395 1997 "1997" 3944 500 0 4312.5 2004 "2004" 3420 1462.5 3000 13387.5 1996 "1996" 3561 540 425.2 2895.6000000000004 1996 "1996" 6794 1501.93 2250 0 2001 "2001" 7372 422.917 167.301 2100 2000 "2000" 1311 375 461.825 2510.4750000000004 2002 "2002" 7363 350 0 1050 1998 "1998" 3730 553.334 282.15 2506.4519999999998 1997 "1997" 3350 470 322 2376 1998 "1998" 5411 437 233.402 2011.2060000000001 1999 "1999" 7372 256.833 86.94 1031.319 2004 "2004" 3560 375 75 0 1997 "1997" 7372 290 145 0 2006 "2006" 7372 535 660 535 2000 "2000" 2836 680.016 1445 5400 2003 "2003" 7370 863 450 5250 2004 "2004" 3674 334.517 247.031 0 2003 "2003" 2040 770 0 2810.5 2005 "2005" 7372 400 351.764 0 2005 "2005" 3531 665.016 1130.527 5386.629000000001 2005 "2005" 2780 750 1200 5850 2006 "2006" 3949 746.539 680 4036.5 2006 "2006" 5013 524.14 0 4192.71 1998 "1998" 7822 1300 4011.663 0 2006 "2006" 3695 434.002 0 868.004 2003 "2003" 7381 550 300 0 2004 "2004" 8071 325 325 0 2001 "2001" 5961 458.654 401.322 0 1998 "1998" 2821 332.495 0 0 1999 "1999" 3730 675.772 350 3077.316 2006 "2006" 3317 520.833 0 4593.518 2000 "2000" 1311 356.154 300 1312.308 1996 "1996" 1311 586.538 229.5 0 1996 "1996" 3312 603.056 0 2392.4840000000004 2005 "2005" 7372 350 360.279 0 2004 "2004" 7372 400 400 1600 2003 "2003" 7990 1250 1850 537.5 2006 "2006" 7372 450 282.15 1464.3 2000 "2000" 1381 400 517.48 0 1997 "1997" 3241 285 163.445 1345.335 1997 "1997" 5311 839.846 409.483 3747.987 1996 "1996" 1311 180.142 0 1600 end

Thank you in advance!

]]>

I am doing a fractional regression. I would like to do a robustness check with bootstrapped SE. So, I typed in the following command

Code:

fracreg logit fl.IV DV control1 control2 control3 i.year, vce(bootstrap, reps(5000))

Code:

time-series operators are not allowed with bootstrap without panels, see tsset

Code:

tsset year

Code:

repeated time values in sample

Code:

year | Freq. Percent Cum. ------------+----------------------------------- 2008 | 31 9.81 9.81 2009 | 37 11.71 21.52 2010 | 40 12.66 34.18 2011 | 45 14.24 48.42 2012 | 37 11.71 60.13 2013 | 39 13.92 74.05 2014 | 41 12.97 87.03 2015 | 30 12.97 100.00 ------------+----------------------------------- Total | 305 100.00

Currently, I'm working on a science research about Determinants of Transfer Pricing Aggressiveness in Vietnam. I'm stuck at handling some negative pre-tax income - one of the determinants (I'm using log transformations). There're some method (like add a constant value, or replace them by value 1), I don't know which method is the best. Please help.

Thanks,

Hoang]]>

I am working on coding a relatively simple ordered probit model with a random parameter/random coefficient for one of the explanatory factor. I am using Maximum Simulated Likelihood to average the integral. At a basic level, I am using draws from normal density using the inverse cumulative for the standard deviation of the random parameter. After running

Just to put this into a context, I have successfully coded a random parameter binary probit/logit with "n" numbers of random parameters using the benchmark example in Cameron and Trivedi (2010). However, am struggling with the ordinal case.

Below is my code:

Code:

clear all sysuse auto recode rep78 (2 = 1) (2 = 1) (5 = 1), gen(nx1) set seed 19011992 *Use this to in turn get draws from the normal distribution *Create 50 draws (S=50) from the uniform for *each observation in the data at hand(N=74) capture drop draws* forvalues i = 1/50 { gen draws`i' = runiform() } capture program drop lfoprobitmsl_bw1 program lfoprobitmsl_bw1 version 14.2 args lnf b1 b2 alpha1 ln_sd //b1 is constant term and b2, b3 are slopes, alpha1 is threshold //Dont use 'sd' as it will get lost if sd<0 tempvar sim_fx sim_avgfx //Get temp holders for the simulated densities local y "$ML_y1" local sd = exp(`ln_sd') //Tranform back to SD qui gen `sim_avgfx' = 0 set seed 19011992 forvalues d = 1/50 { quietly replace `lnf' = ln(normal(-(`b1' + `b2' + `sd'*invnormal(draws`d')*mpg))) if $ML_y1 == 0 /*contribution to log-likelihood for outcome y=0 respondents*/ quietly replace `lnf' = ln(normal(-(`b1' + `b2' + `sd'*invnormal(draws`d')*mpg)+`alpha1') - normal(-(`b1' + `b2' + `sd'*invnormal(draws`d')*mpg))) if $ML_y1 == 1 /*contribution to log-likelihood for outcome y=1 respondents*/ quietly replace `lnf' = ln(1-normal(-(`b1' + `b2' + `sd'*invnormal(draws`d')*mpg)+`alpha1')) if $ML_y1 == 2 /*contribution to log-likelihood for outcome y=2 respondents*/ qui gen `sim_fx' = `lnf' qui replace `sim_avgfx' = `sim_avgfx' + `sim_fx'/50 drop `sim_fx' } end //Let's now calculate the MSL estimator ml model lf lfoprobitmsl_bw1 (b1: nx = ) (b2: mpg, nocons) (alpha1:) (ln_sd:) ml init 1 1 1 0, copy //ml check ml maximize, difficult

My question is: Am I coding the likelihood part correctly or is the issue really with my poor initial values? Have tried a bunch of initial values though.

Any help will be truly appreciated. Thanks for your time!

Sincerely, Behram

]]>

We conducted a matched case control study.

For each case, 2 control subjects are matched.

These 2 control subjects are from 2 different clusters.

---------------------------------------------------

Data:

group case control_clust1 control_clust2

1 exposed non-exp non-exp

2 exposed non-exp non-exp

3 non-exp non-exp non-exp

4 non-exp non-exp non-exp

5 non-exp non-exp non-exp

6 non-exp non-exp non-exp

7 non-exp non-exp non-exp

8 non-exp non-exp non-exp

9 non-exp non-exp non-exp

10 non-exp non-exp non-exp

11 non-exp non-exp non-exp

---------------------------------------------------

First, I compared between case and control_clust1,

to see the effect of exposure, as in:

. mcci 2 9 0 11

| Controls |

Cases | Exposed Unexposed | Total

-----------------+------------------------+------------

Exposed | 2 9 | 11

Unexposed | 0 11 | 11

-----------------+------------------------+------------

Total | 2 20 | 22

McNemar's chi2(1) = 9.00 Prob > chi2 = 0.0027

Exact McNemar significance probability = 0.0039

Proportion with factor

Cases .5

Controls .0909091 [95% Conf. Interval]

--------- --------------------

difference .4090909 .158186 .6599959

ratio 5.5 1.570118 19.26607

rel. diff. .45 .2319678 .6680322

odds ratio . 1.973826 . (exact)

Naturally, comparison between case and control_clust2

generates the same result.

Q1. Can I describe this result in the manuscript as

"OR = 1.97 (P<0.0027) based upon McNemar's chi square"?

This seems peculiar because there is no confidence

interval for the OR.

************************************************** ******

Next, I aggregated the two control clusters, and

used clogit, as in:

. clogit disease exposure,group(group)

Iteration 0: log likelihood = -12.084735

Iteration 1: log likelihood = -9.8875106 (not concave)

Iteration 2: log likelihood = -9.8875106

Conditional (fixed-effects) logistic regression Number of obs = 33

LR chi2(0) = 4.39

Prob > chi2 = .

Log likelihood = -9.8875106 Pseudo R2 = 0.1818

------------------------------------------------------------------------------

disease | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

exposure | 2.39e+20 . . . . .

------------------------------------------------------------------------------

Q2. This result seems more peculiar!, because mcc showed

a highly significant result (P<0.0027).

What was wrong?

Your assistance would be appreciated.

Yosh

]]>

What if I don't want the d. and gen d.variable to defien first difference, is there any other way.]]>

Healthcare expenditure data always has mass zeros and skewness. Commonly, GLM with log link and gamma distribution is used for modeling healthcare expenditure data. How can we measure overfitting using Copas Test by Stata?

Below is the healthcare expenditure dataset sample

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float healthdollar double(grp age) 0 1 52 0 1 47 3100 2 51 1200 2 73 4500 2 65 0 2 67 200 2 60 0 1 51 145 2 59 2040 1 42 end

Best,

Jack LiangWang]]>