converting regression output (only the intercept) into a new variable/column

Richard Williams

Join Date: Apr 2014

Posts: 5008
#31

23 Jul 2014, 16:40

Clyde and I are both giving you code we can't check, so I hope you are doing spot checks to see if it s right, e.g. if John Smith was supposed to be managing Dodge and Cox between Feb 1972 and January 1980, make sure that he got matched up that way in your records. And that his reign didn't accidentally get extended into the month before or the month after.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#32

23 Jul 2014, 16:46

Where did this data set come from. Did you create it yourself? If so you are certainly free to tidy up your own work.

One way or another, I would clean it up. Take notes on all the changes. Be very careful you don't introduce new mistakes. Send a note to whoever created the data and offer to give them your cleaned up version.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#33

23 Jul 2014, 17:10

Also, are you sure you don't have errors in the other direction: where two distinct people have the same name?

In general, relying on names as unique identifiers is a very risky proposition. If these are fund managers, do they have any kind of license or credential number that you can use as an identifier instead of their names? Or can you get additional identifying information such as their dates of birth and place or date of graduation from business school? That would be more reliable.

Moving in an entirely different direction, I have begun to dimly perceive what you are trying to do here and I question whether doing 600 regressions is the right way to go about it. As Richard has pointed out, with small numbers of observations for at least some of the fund-manager entities, the standard errors of the alphas will be very large. My bet is that, using your methodology, you will find the vast majority of both the best and the worst performing managers will be among those for whom you have only a small amount of data.

I think you might be better off with a multi-level mixed-model approach. You seem to have managers nested within funds and time series of observations within that. A mixed regression with random intercepts at the fund and manager level, and, if it makes sense in terms of your underlying theory, random slopes at one or both of those levels for your other predictors might be a more sensible approach. You would be able to draw inferences about the distribution of effects of the managers, and differences among the funds themselves. And if you want individual manager-level estimates, the post-estimation -predict, reffects- command will enable you to get them. Of course, this approach relies on some distributional assumptions that may be off base, or may only work if you transform some variables. But if I were pursuing this type of goal, I would be more inclined to invest effort into fitting that kind of model than to do 600 separate regressions, some of which are based on a handful of observations. Just sayin'.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#34

23 Jul 2014, 17:34

To set things up for -levenshtein-, you will need to create a data set containing all pairs of names. So first create a data set consisting of all distinct names in your data:

Code:

keep first_name last_name duplicates drop save all_names, replace

Then join it with itself to create all pairs of names, and drop pairings of a name with itself:

Code:

use all_names, clear rename *_name *_name_1 cross using all_names drop if first_name_1 == first_name & last_name_1 == last_name

Now apply -levenshtein- to first_name/first_name_1 and last_name/last_name_1. Delete any observations that have wildly large levenshtein differences: those won't be potential misidentifications. Then manually examine those with small levenshtein differences: those are where you are most likely to find alternative spellings/nicknames, etc. I don't provide code for this because the machine I'm working on right now does not have -levenshtein- installed and I don't remember the details of how to use it.

I can't comment on your -reclink- command for the same reason, though I have found it to be extremely useful in the past.

In addition to -levenshtein-, the -soundex()- function is often helpful in identifying names that are likely to be alternate spellings of each other. (It was designed by the US Census bureau for precisely this purpose and, rather than being a pure string-manipulator, it is based on the phonotactics [Nick Cox will probably correct me here and say the right word is lexotactics] of English names.)

With any of these programs, you still have to review the output and decide which near-matches you consider to be the same person, and which you do not. To that extent, ancillary data such as dates of birth, credential numbers, etc., is very helpful in resolving questionable cases (and also makes -reclink- work better.)

And, finally, after trying all of those things, a manual review of a tabulation of all the names (difficult, but feasible when you have 600 of them) will probably be needed in the end to pick up additional quirky cases that no program found.

And, at the risk of sounding like a broken record, none of these approaches will identify cases where two different managers have the same name. You really need some additional information to resolve that.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#35

23 Jul 2014, 17:39

Moving in an entirely different direction, I have begun to dimly perceive what you are trying to do here and I question whether doing 600 regressions is the right way to go about it.

Actually, I believe it is several thousand regressions. Lydia said there are thousands of funds, and furthermore within funds separate regressions are run for each manager (in effect, the data might as well be xtset by manager). The maximum possible N would be 600 (if somebody managed the same fund for all 50 years). If somebody managed for 2 years, the N would be 24. In the example Lydia gave in her link, the same fund was analyzed for 10 years yielding 120 records. I don't know what Lydia is going to do with all these alphas once she has them, but the potentially small and widely varying N sizes make me a bit nervous too.I know nothing about the topic but I would probably feel more comfortable if there was some minimal N size (e.g. at least 5 years) or, perhaps, did something like the multilevel approach you suggest.

My bet is that, using your methodology, you will find the vast majority of both the best and the worst performing managers will be among those for whom you have only a small amount of data.

I bet you are right. The main conclusion may be that you should never let anybody manage more than 6 months -- either because they will do terribly and you want them out of there, or because the most spectacular performances come from managers who only manage a short time. Such a conclusion may be questionable.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#36

23 Jul 2014, 17:45

Is there a quick to see the resemblance between names in 1 column/variable?

You might start a new thread on that, or search the archives. I am pretty sure there have been discussions on identifying near matches, but I've never paid any attention to them. Give examples of near matches like you gave earlier.

EDIT: Once again Clyde seems to have answered the question while I am typing my own wild uneducated guess.

Last edited by Richard Williams; 23 Jul 2014, 17:49.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#37

23 Jul 2014, 18:03

So, you recommend making the changes by hand? In edit-mode? That probably leaves codes in Stata, which I can put in my .do-file

I would strongly recommend NOT making the changes by hand in edit mode. Doing that will leave you code that you can keep in your dofile but it will be code that looks like this:

Code:

replace name="STEPHEN WILLIAMS" in 67

That's not a particularly helpful way to document since it requires that you know what was in observation 67 prior to your change. If you keep a copy of the original variable it's easier to figure out what you did, but in my opinion that still makes it too difficult to figure out what changes were made.

It may take longer to write the code but code of the form makes it clearer exactly what changes you made

Code:

replace name="STEPHEN WILLIAMS" if name=="STEVE WILLIAMS"

In either case I would keep a copy of your original name variable before doing this cleaning. Lots of people recommend recoding into new variables but if you've already written code using the existing variable names I find that it's often easier to start with something like

Code:

clonevar orgname=name

Then you can go ahead and make your changes to the name variable. That way you have a record of the original variable to refer back to if it's not entirely clear from your code how things changed.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#38

23 Jul 2014, 20:01

In general, relying on names as unique identifiers is a very risky proposition. If these are fund managers, do they have any kind of license or credential number that you can use as an identifier instead of their names? Or can you get additional identifying information such as their dates of birth and place or date of graduation from business school? That would be more reliable.

Actually, I don't think it is that risky, because duplicate names are only a problem if you have duplicate names at the same fund. It doesn't matter if somebody named Jim Smith heads two different funds; it could even be the same Jim Smith. Put another way names aren't unique identifiers because you are actually using name and fundid#.

Like I said before, I might be worried about Jim Smith taking two years off and then taking over his old fund again. Or, maybe Jim Sr. and Jim Jr. heading the fund at different times. Lydia says that doesn't happen. But if it did I might actually make a point of calling one Jim and the other James just to keep them straight.

replace name="STEPHEN WILLIAMS" if name=="STEVE WILLIAMS"

Actually, I don't think that kind of recode is likely. If you've got both Steve and his alter ego Stephen at a fund (as in the earlier Peter and Pedro example) I think you'll drop one of their records and fix the start and end dates of the other. Of course, deleting one record and recoding another is semi-complicated which makes it all the more important that, one way or another, your changes are well documented. I night make a point of writing lots of comments under the theory that however I made the changes, they might not make sense to me five years from now just based on the code alone.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#39

23 Jul 2014, 20:16

What's the advantage of using -xtset-? I first used it, but later I removed it because of the error: repeated values within panel.

Well, one of the advantages is that it warns you when you have repeated values within a panel, e.g. you have two records for 1980. If you did xtset fund year, that would have been wrong because you have multiple records for each year, one for each month. You want something that combines month and year. Clyde may have already given that earlier but if not I imagine it is not too hard to do,

Another advantage is the ability to easily use lagged values. And to use the various xt commands if you need them.

But for your purposes, maybe you are fine.

I am too tired to think about the other Qs, plus I have my own short report to finish writing! But if you want to test "which kind of managers perform better than others" you probably have another data base manipulation ahead of you, e.g. you'll have to use collapse or something like that to get it down to one record per manager. I would definitely include the number of records for each manager, since, as Clyde said, sample size could have a big impact on the value and precision of alpha. If you have # records, you could do stuff like limit the analysis to managers who had headed their funds for at least 5 years,

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment