Help: rogue observations spontaneously created by STATA ruining by Treatment vs Control Group analysis

Emmanuel Pezier

Join Date: Jul 2017

Posts: 8
#1

Help: rogue observations spontaneously created by STATA ruining by Treatment vs Control Group analysis

23 Jul 2017, 11:13

Hi all. Hope someone might be able to help!

I have an excel data set that loads nicely into STATA. I then generate a number of variables, mostly binary, in order to perform ttest and teffects psmatch commands.

When I tab the variables after generating them, they all look correct, with observation grouped into 0 and 1.

The first series of commands run smoothly, until I begin to get errors, after about 5-6 commands.

It seems the dataset acquires rogue observations that are neither in group 0 or 1 of the binary variable generated.

Instead they live in a (3rd) group by themselves, listed as something very small 0.00etc...

I can manually look through Data Editor to remove the erroneous observation. But it is time consuming!!

Is there a better fix? What is the root of the problem?

Many thanks,

Emmanuel
Tags: None
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#2

23 Jul 2017, 11:32

Your code is producing the errors, but it is not possible to know why unless you show us the code.
Comment
Emmanuel Pezier

Join Date: Jul 2017

Posts: 8
#3

23 Jul 2017, 12:17

here's an example:
gen advised=strpos(IssuerAdvisor, "") | strpos(VendorAdvisor, "")

thanks so much!
Comment
Emmanuel Pezier

Join Date: Jul 2017

Posts: 8
#4

23 Jul 2017, 12:26

I think the problem may be with my "Proceeds" variable. It's Continuous. 30 to 1000+ But with some (deliberately) blank cells. Maybe the blanks are getting caught up?

gen large=1 if Proceeds>=1000
replace large=0 if Proceeds<1000
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#5

23 Jul 2017, 14:13

you still haven't really given us enough info; however, my guess is that you have a precision issue; see

Code:

help precision
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

23 Jul 2017, 15:34

Emmanuel Pezier I agree with Rich Goldstein that this is probably a precision issue. But if you want more concrete, specific advice, you will have to post back with some example data where things start off correctly, and the complete code you ran between that point and the point where the "rogue" values started showing up. Without seeing the details, nobody is going to be able to guess what the specific problem is.

Please be sure to read and follow the advice in FAQ #12 so that your re-post will use the -dataex- command to show the example data, and the code is posted between code delimiters. While -dataex- should always be used by everyone to post example data, in your instance it is especially crucial, because only with -dataex- will whoever wants to help you be able to faithfully replicate your example data including data storage types. Without that information, it is unlikely anybody will be able to get to the root of the problem.
1 like
Comment
Emmanuel Pezier

Join Date: Jul 2017

Posts: 8
#7

23 Jul 2017, 16:59

Thanks so much for your replies. I'm new to STATA, as I'm sure you have noticed (!). I'm unsure how to use -dataex-. Perhaps if I describe the variables, that might help?

IssuerAdvisor str43 %43s
VendorAdvisor str26 %26s
BooksName str157 %157s
Proceeds int %14.2f
Books byte %14.2f
advised double %10.0g
Top8 double %10.0g
large double %10.0g
mid double %10.0g
small double %10.0g
solebooks double %10.0g
perf1d double %14.2f

Here's my code, up to the ttest where the error occurs:

gen advised=strpos(IssuerAdvisor, "") | strpos(VendorAdvisor, "")

gen Top8 =strpos(BooksName, "Goldman") | strpos(BooksName, "Morgan Stanley") | strpos(BooksName, "Lynch") | strpos(BooksName, "Citi") | strpos(BooksName, "Suisse") | strpos(BooksName, "Deutsche") | strpos(BooksName, "UBS") | strpos(BooksName, "JPMorgan")

gen solebooks=1 if Books==1

replace solebooks=0 if Books!=1

gen large=1 if Proceeds>=1000

replace large=0 if Proceeds<1000

gen small=1 if Proceed<=100 & Proceeds!=0

replace small=0 if Proceeds>100

gen mid=1 if Proceeds>100 & Proceeds<1000

replace mid=0 if Proceeds<=100 & Proceeds!=0

replace mid=2 if Proceeds>1000 & Proceeds!=0

ttest perf1d, by(advised)
more than 2 groups found, only 2 allowed
r(420);

I looked at [help precision], but "set type double" doesn't seem to work.

Apologies for my incompetence - any pointers greatly appreciated!

Thanks again,

Emmanuel
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#8

23 Jul 2017, 17:30

Well, I do not see why you are getting the problem you have there. Had you used -dataex- and shown some example data I would have tried to reproduce your problem and then troubleshoot it, but I can't get this to happen with any of my data sets. So I think you're going to have to learn to use -dataex-. It's really, really easy, even for complete beginners. Run -ssc install dataex- to install the -dataex- command. Then run -help dataex- and read the instructions that show up on your screen. I'm quite confident you can do it, even if this is your very first hour using Stata.

That said, your command

Code:

gen advised=strpos(IssuerAdvisor, "") | strpos(VendorAdvisor, "")

though legal, makes no sense. The null string ("") is always found in any string, so both strpos() functions will return 1 (true), and consequently advised will always be 1. There will be no zero values. You can see it for yourself: run -assert advised == 0-. Consequently, the error message you should be getting is:

Code:

1 group found, 2 required r(420);

So you need to think about what the correct way to define the variable advised is, and code that. If you're still running into this same error message after doing that, post back, using an example with -dataex-.
Comment

Emmanuel Pezier

Join Date: Jul 2017
Posts: 8

24 Jul 2017, 02:54

Many thanks Clyde. I've tried to run -dataex- as you suggested, with 20 obs. Hope I've done it correctly? I realise it's Monday now, so you are probably tied up, but any further suggestions would be greatly appreciated. Kind regards, Emmanuel

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(PriceDate Proceeds) str43 IssuerAdvisor str26 VendorAdvisor byte Books str157 BooksName double perf1d
19487   80 ""       ""  3 "Credit Suisse
SG Corporate & Investment Banking
Barclays"                                                                                                        -16
20909   82 ""       ""  4 "Citi
Credit Suisse
Mediobanca
UniCredit"                                                                                                                        4.55
19500  367 "Lazard" "" 12 "Goldman Sachs
Deutsche Bank
JPMorgan
Barclays
Credit Suisse
Morgan Stanley
BNP Paribas
UBS
Citi
HSBC
SG Corporate & Investment Banking
Lazard Capital Markets" -3.13
19010  233 ""       ""  3 "Bank of America Merrill Lynch
Mirabaud & Cie
Renaissance Capital"                                                                                              -5.22
18353   51 ""       ""  2 "Canaccord Genuity Corp
Renaissance Capital"                                                                                                                      .71
20278   76 "Lazard" ""  2 "Intesa Sanpaolo SpA
Intermonte Holding SIM SpA"                                                                                                                32.22
19332  659 ""       ""  4 "Barclays
JPMorgan
Morgan Stanley
IPOPEMA"                                                                                                                       6.84
18571 1003 ""       ""  4 "Goldman Sachs
JPMorgan
Morgan Stanley
VTB Capital"                                                                                                             29.96
19761  190 ""       ""  2 "Credit Suisse
Banco Espirito Santo"                                                                                                                            -2.19
18382  208 ""       ""  1 "Sberbank CIB"                                                                                                                                                      1
end
format %tdnn/dd/CCYY PriceDate
label var PriceDate "PriceDate" 
label var Proceeds "Proceeds" 
label var IssuerAdvisor "IssuerAdvisor" 
label var VendorAdvisor "VendorAdvisor" 
label var Books "#Books" 
label var BooksName "BooksName" 
label var perf1d "perf1d"

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

#10

24 Jul 2017, 04:47

You have some long strings in there which cause data to be wrapped around in the Statalist forum, or at least when I copy and paste. But several small and large puzzles remain.

1. You say 20 observations but I count 10.

2. The numeric variables can be listed easily but nothing in the example suggests a data problem.

Code:

. ds, has(type numeric)
PriceDate  Proceeds   Books      perf1d

. l `r(varlist)'

     +---------------------------------------+
     | PriceDate   Proceeds   Books   perf1d |
     |---------------------------------------|
  1. |  5/9/2013         80       3      -16 |
  2. | 3/31/2017         82       4     4.55 |
  3. | 5/22/2013        367      12    -3.13 |
  4. | 1/18/2012        233       3    -5.22 |
  5. |  4/1/2010         51       2      .71 |
     |---------------------------------------|
  6. |  7/9/2015         76       2    32.22 |
  7. | 12/5/2012        659       4     6.84 |
  8. | 11/5/2010       1003       4    29.96 |
  9. |  2/7/2014        190       2    -2.19 |
 10. | 4/30/2010        208       1        1 |
     +---------------------------------------+

3. Most crucially, you have not addressed Clyde's point that your indicator variable should be identically 1 as empty is found even within non-empty strings:

Code:

. display strpos("frog", "") | strpos("toad", "")
1

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

24 Jul 2017, 05:49

I note that some of the values of IssueAdvisor and VendorAdvisor are missing.

Code:

. display strpos("","") 0

So the strpos function syntax appears to be an awkward version of

Code:

. display ("frog"!="") 1 . display (""!="") 0

or

Code:

. display !missing("frog") 1 . display !missing("") 0

I leave it to others to figure out what this all means for Emmanuel's problem; I'm on pre-coffee time at the moment and only accessible to inspirations, not to hard thought.
Comment
Emmanuel Pezier

Join Date: Jul 2017

Posts: 8
#12

24 Jul 2017, 07:19

Thanks so much for your replies. Yes, the IssuerAdvisoir and VendorAdvisor strings have blanks, precisely where there is no advisor. Hence, I tried to generate the advised vs non-advised groups with my (clumsy) code. Is there a better way? The same issue exists with the Proceeds data. There are missing values, which are meaningful, and which are not the same as zeros. My code is probably clumsy there too? Thanks again to all.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#13

24 Jul 2017, 07:24

As William's answer implies, missing() is the key function here.

Code:

gen Advised = missing(IssuerAdvisor, VendorAdvisor)

may be what you seek, but Advised is possibly a wrong name for the situation where either Advisor is missing.
Comment
Emmanuel Pezier

Join Date: Jul 2017

Posts: 8
#14

24 Jul 2017, 07:27

Indeed, I can use your code but instead name the variable "Non-advised". Many thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#15

24 Jul 2017, 07:28

You can't use a hyphen in a variable name, but otherwise yes.
Comment

Announcement