Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help: rogue observations spontaneously created by STATA ruining by Treatment vs Control Group analysis

    Hi all. Hope someone might be able to help!

    I have an excel data set that loads nicely into STATA. I then generate a number of variables, mostly binary, in order to perform ttest and teffects psmatch commands.

    When I tab the variables after generating them, they all look correct, with observation grouped into 0 and 1.

    The first series of commands run smoothly, until I begin to get errors, after about 5-6 commands.

    It seems the dataset acquires rogue observations that are neither in group 0 or 1 of the binary variable generated.

    Instead they live in a (3rd) group by themselves, listed as something very small 0.00etc...

    I can manually look through Data Editor to remove the erroneous observation. But it is time consuming!!

    Is there a better fix? What is the root of the problem?

    Many thanks,

    Emmanuel


  • #2
    Your code is producing the errors, but it is not possible to know why unless you show us the code.

    Comment


    • #3
      here's an example:
      gen advised=strpos(IssuerAdvisor, "") | strpos(VendorAdvisor, "")

      thanks so much!

      Comment


      • #4
        I think the problem may be with my "Proceeds" variable. It's Continuous. 30 to 1000+ But with some (deliberately) blank cells. Maybe the blanks are getting caught up?

        gen large=1 if Proceeds>=1000
        replace large=0 if Proceeds<1000

        Comment


        • #5
          you still haven't really given us enough info; however, my guess is that you have a precision issue; see
          Code:
          help precision

          Comment


          • #6
            Emmanuel Pezier I agree with Rich Goldstein that this is probably a precision issue. But if you want more concrete, specific advice, you will have to post back with some example data where things start off correctly, and the complete code you ran between that point and the point where the "rogue" values started showing up. Without seeing the details, nobody is going to be able to guess what the specific problem is.

            Please be sure to read and follow the advice in FAQ #12 so that your re-post will use the -dataex- command to show the example data, and the code is posted between code delimiters. While -dataex- should always be used by everyone to post example data, in your instance it is especially crucial, because only with -dataex- will whoever wants to help you be able to faithfully replicate your example data including data storage types. Without that information, it is unlikely anybody will be able to get to the root of the problem.

            Comment


            • #7
              Thanks so much for your replies. I'm new to STATA, as I'm sure you have noticed (!). I'm unsure how to use -dataex-. Perhaps if I describe the variables, that might help?

              IssuerAdvisor str43 %43s
              VendorAdvisor str26 %26s
              BooksName str157 %157s
              Proceeds int %14.2f
              Books byte %14.2f
              advised double %10.0g
              Top8 double %10.0g
              large double %10.0g
              mid double %10.0g
              small double %10.0g
              solebooks double %10.0g
              perf1d double %14.2f

              Here's my code, up to the ttest where the error occurs:

              gen advised=strpos(IssuerAdvisor, "") | strpos(VendorAdvisor, "")

              gen Top8 =strpos(BooksName, "Goldman") | strpos(BooksName, "Morgan Stanley") | strpos(BooksName, "Lynch") | strpos(BooksName, "Citi") | strpos(BooksName, "Suisse") | strpos(BooksName, "Deutsche") | strpos(BooksName, "UBS") | strpos(BooksName, "JPMorgan")

              gen solebooks=1 if Books==1

              replace solebooks=0 if Books!=1

              gen large=1 if Proceeds>=1000

              replace large=0 if Proceeds<1000

              gen small=1 if Proceed<=100 & Proceeds!=0

              replace small=0 if Proceeds>100

              gen mid=1 if Proceeds>100 & Proceeds<1000

              replace mid=0 if Proceeds<=100 & Proceeds!=0

              replace mid=2 if Proceeds>1000 & Proceeds!=0

              ttest perf1d, by(advised)
              more than 2 groups found, only 2 allowed
              r(420);


              I looked at [help precision], but "set type double" doesn't seem to work.

              Apologies for my incompetence - any pointers greatly appreciated!

              Thanks again,

              Emmanuel




              Comment


              • #8
                Well, I do not see why you are getting the problem you have there. Had you used -dataex- and shown some example data I would have tried to reproduce your problem and then troubleshoot it, but I can't get this to happen with any of my data sets. So I think you're going to have to learn to use -dataex-. It's really, really easy, even for complete beginners. Run -ssc install dataex- to install the -dataex- command. Then run -help dataex- and read the instructions that show up on your screen. I'm quite confident you can do it, even if this is your very first hour using Stata.

                That said, your command
                Code:
                gen advised=strpos(IssuerAdvisor, "") | strpos(VendorAdvisor, "")
                though legal, makes no sense. The null string ("") is always found in any string, so both strpos() functions will return 1 (true), and consequently advised will always be 1. There will be no zero values. You can see it for yourself: run -assert advised == 0-. Consequently, the error message you should be getting is:

                Code:
                1 group found, 2 required
                r(420);
                So you need to think about what the correct way to define the variable advised is, and code that. If you're still running into this same error message after doing that, post back, using an example with -dataex-.

                Comment


                • #9
                  Many thanks Clyde. I've tried to run -dataex- as you suggested, with 20 obs. Hope I've done it correctly? I realise it's Monday now, so you are probably tied up, but any further suggestions would be greatly appreciated. Kind regards, Emmanuel

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input int(PriceDate Proceeds) str43 IssuerAdvisor str26 VendorAdvisor byte Books str157 BooksName double perf1d
                  19487   80 ""       ""  3 "Credit Suisse
                  SG Corporate & Investment Banking
                  Barclays"                                                                                                        -16
                  20909   82 ""       ""  4 "Citi
                  Credit Suisse
                  Mediobanca
                  UniCredit"                                                                                                                        4.55
                  19500  367 "Lazard" "" 12 "Goldman Sachs
                  Deutsche Bank
                  JPMorgan
                  Barclays
                  Credit Suisse
                  Morgan Stanley
                  BNP Paribas
                  UBS
                  Citi
                  HSBC
                  SG Corporate & Investment Banking
                  Lazard Capital Markets" -3.13
                  19010  233 ""       ""  3 "Bank of America Merrill Lynch
                  Mirabaud & Cie
                  Renaissance Capital"                                                                                              -5.22
                  18353   51 ""       ""  2 "Canaccord Genuity Corp
                  Renaissance Capital"                                                                                                                      .71
                  20278   76 "Lazard" ""  2 "Intesa Sanpaolo SpA
                  Intermonte Holding SIM SpA"                                                                                                                32.22
                  19332  659 ""       ""  4 "Barclays
                  JPMorgan
                  Morgan Stanley
                  IPOPEMA"                                                                                                                       6.84
                  18571 1003 ""       ""  4 "Goldman Sachs
                  JPMorgan
                  Morgan Stanley
                  VTB Capital"                                                                                                             29.96
                  19761  190 ""       ""  2 "Credit Suisse
                  Banco Espirito Santo"                                                                                                                            -2.19
                  18382  208 ""       ""  1 "Sberbank CIB"                                                                                                                                                      1
                  end
                  format %tdnn/dd/CCYY PriceDate
                  label var PriceDate "PriceDate" 
                  label var Proceeds "Proceeds" 
                  label var IssuerAdvisor "IssuerAdvisor" 
                  label var VendorAdvisor "VendorAdvisor" 
                  label var Books "#Books" 
                  label var BooksName "BooksName" 
                  label var perf1d "perf1d"

                  Comment


                  • #10
                    You have some long strings in there which cause data to be wrapped around in the Statalist forum, or at least when I copy and paste. But several small and large puzzles remain.

                    1. You say 20 observations but I count 10.

                    2. The numeric variables can be listed easily but nothing in the example suggests a data problem.


                    Code:
                    . ds, has(type numeric)
                    PriceDate  Proceeds   Books      perf1d
                    
                    . l `r(varlist)'
                    
                         +---------------------------------------+
                         | PriceDate   Proceeds   Books   perf1d |
                         |---------------------------------------|
                      1. |  5/9/2013         80       3      -16 |
                      2. | 3/31/2017         82       4     4.55 |
                      3. | 5/22/2013        367      12    -3.13 |
                      4. | 1/18/2012        233       3    -5.22 |
                      5. |  4/1/2010         51       2      .71 |
                         |---------------------------------------|
                      6. |  7/9/2015         76       2    32.22 |
                      7. | 12/5/2012        659       4     6.84 |
                      8. | 11/5/2010       1003       4    29.96 |
                      9. |  2/7/2014        190       2    -2.19 |
                     10. | 4/30/2010        208       1        1 |
                         +---------------------------------------+
                    3. Most crucially, you have not addressed Clyde's point that your indicator variable should be identically 1 as empty is found even within non-empty strings:

                    Code:
                    . display strpos("frog", "") | strpos("toad", "")
                    1

                    Comment


                    • #11
                      I note that some of the values of IssueAdvisor and VendorAdvisor are missing.
                      Code:
                      . display strpos("","")
                      0
                      So the strpos function syntax appears to be an awkward version of
                      Code:
                      . display ("frog"!="")
                      1
                      
                      . display (""!="")
                      0
                      or
                      Code:
                      . display !missing("frog")
                      1
                      
                      . display !missing("")
                      0
                      I leave it to others to figure out what this all means for Emmanuel's problem; I'm on pre-coffee time at the moment and only accessible to inspirations, not to hard thought.

                      Comment


                      • #12
                        Thanks so much for your replies. Yes, the IssuerAdvisoir and VendorAdvisor strings have blanks, precisely where there is no advisor. Hence, I tried to generate the advised vs non-advised groups with my (clumsy) code. Is there a better way? The same issue exists with the Proceeds data. There are missing values, which are meaningful, and which are not the same as zeros. My code is probably clumsy there too? Thanks again to all.

                        Comment


                        • #13
                          As William's answer implies, missing() is the key function here.

                          Code:
                           gen Advised = missing(IssuerAdvisor, VendorAdvisor)
                          may be what you seek, but Advised is possibly a wrong name for the situation where either Advisor is missing.

                          Comment


                          • #14
                            Indeed, I can use your code but instead name the variable "Non-advised". Many thanks!

                            Comment


                            • #15
                              You can't use a hyphen in a variable name, but otherwise yes.

                              Comment

                              Working...
                              X