Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • encode error; shows too many values

    Hello ,
    I wish to destring caseid to create caseid FE.
    Code:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str15 caseid
    " 01000101    02"
    " 01000101    02"
    " 01000101    02"
    " 01000109    01"
    " 01000109    01"
    " 01000110    02"
    " 01000111    02"
    " 01000120    01"
    " 01000120    01"
    " 01000123    02"
    " 01000129    02"
    " 01000129    02"
    " 01000130    02"
    " 01000130    02"
    " 01000130    02"
    " 01000134    02"
    " 01000134    02"
    " 01000143    01"
    " 01000143    01"
    " 01000150    02"
    " 01000150    02"
    " 01000150    02"
    " 01000152    02"
    " 01000152    02"
    " 01000157    02"
    " 01000157    02"
    " 01000162    02"
    " 01000162    02"
    " 01000171    02"
    " 01000171    02"
    " 01000172    02"
    " 01000172    02"
    " 01000196    02"
    " 01000196    02"
    " 01000214    02"
    " 01000214    02"
    " 01000216    02"
    " 01000216    02"
    " 01000216    02"
    " 01000221    02"
    " 01000221    02"
    " 01000221    02"
    " 01000227    04"
    " 01000227    04"
    " 01000245    04"
    " 01000257    02"
    " 01000257    02"
    " 01000258    02"
    " 01000258    02"
    " 01000258    02"
    " 01000274    02"
    " 01000274    02"
    " 01000274    02"
    " 01000275    03"
    " 01000275    03"
    " 01000280    02"
    " 01000280    02"
    " 01000282    02"
    " 01000282    02"
    " 01000283    02"
    " 01000283    02"
    " 01000283    02"
    " 01000287    02"
    " 01000287    02"
    " 01000287    02"
    " 01000290    02"
    " 01000290    02"
    " 01000310    02"
    " 01000310    02"
    " 01000313    02"
    " 01000319    02"
    " 01000366    02"
    " 01000366    02"
    " 01000366    02"
    " 01000366    02"
    " 01000371    04"
    " 01000371    04"
    " 01000373    02"
    " 01000373    02"
    " 01000374    02"
    " 01000374    02"
    " 01000375    03"
    " 01000383    02"
    " 01000383    02"
    " 01000406    02"
    " 01000406    02"
    " 01000406    02"
    " 01000412    05"
    " 01000412    05"
    " 01000429    02"
    " 01000430    02"
    " 01000433    02"
    " 01000433    02"
    " 01000433    02"
    " 01000434    03"
    " 01000434    03"
    " 01000435    02"
    " 01000435    02"
    " 01000435    02"
    " 01000445    02"
    end

    However destring command doesnt work. Shows following
    Code:
    destring caseid , replace
    caseid contains nonnumeric characters; no replace
    and when I run encode it shows the following
    Code:
    . encode caseid , gen(momid)
    too many values
    r(134);
    How can I modify this variable in order to add this as a FE in regressions?

  • #2
    I think that internal spaces are considered nonnumeric characters. Try something like this.
    Code:
    generate long case_fe = real(subinstr(caseid, " ", "", .))
    assert !mi(case_fe)

    Comment


    • #3
      Originally posted by Raphael George View Post
      However destring command doesnt work.
      I keep forgetting that destring has an ignore() option. So you can try this as an alternative.
      Code:
      destring caseid, replace ignore(" ")

      Comment


      • #4
        There are missing values for case_fe

        Code:
        dataex caseid case_fe if case_fe==.
        
        ----------------------- copy starting from the next line -----------------------
        
        
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str15 caseid long case_fe
        " 22000102    02" .
        " 22000102    02" .
        " 22000102    02" .
        " 22000109    01" .
        " 22000109    01" .
        " 22000126    02" .
        " 22000126    02" .
        " 22000126    02" .
        " 22000126    03" .
        " 22000128    02" .
        " 22000128    02" .
        " 22000128    02" .
        " 22000128    02" .
        " 22000128    02" .
        " 22000131    02" .
        " 22000131    02" .
        " 22000131    02" .
        " 22000131    02" .
        " 22000131    02" .
        " 22000131    02" .
        " 22000140    02" .
        " 22000140    02" .
        " 22000140    02" .
        " 22000156    02" .
        " 22000156    02" .
        " 22000156    02" .
        " 22000156    02" .
        " 22000161    02" .
        " 22000161    02" .
        " 22000163    01" .
        " 22000163    01" .
        " 22000163    01" .
        " 22000170    02" .
        " 22000170    02" .
        " 22000170    02" .
        " 22000173    01" .
        " 22000173    01" .
        " 22000173    01" .
        " 22000173    01" .
        " 22000174    02" .
        " 22000174    02" .
        " 22000174    02" .
        " 22000174    02" .
        " 22000174    02" .
        " 22000174    02" .
        " 22000180    02" .
        " 22000180    02" .
        " 22000180    02" .
        " 22000180    02" .
        " 22000186    02" .
        " 22000186    02" .
        " 22000186    02" .
        " 22000187    04" .
        " 22000202    02" .
        " 22000202    02" .
        " 22000207    02" .
        " 22000207    02" .
        " 22000211    02" .
        " 22000211    02" .
        " 22000215    02" .
        " 22000215    02" .
        " 22000215    02" .
        " 22000215    02" .
        " 22000215    02" .
        " 22000244    02" .
        " 22000244    02" .
        " 22000244    02" .
        " 22000247    02" .
        " 22000247    02" .
        " 22000249    02" .
        " 22000249    02" .
        " 22000255    02" .
        " 22000255    02" .
        " 22000260    02" .
        " 22000260    02" .
        " 22000269    02" .
        " 22000269    02" .
        " 22000269    02" .
        " 22000269    02" .
        " 22000269    02" .
        " 22000272    02" .
        " 22000274    02" .
        " 22000274    02" .
        " 22000279    02" .
        " 22000279    02" .
        " 22000279    02" .
        " 22000287    02" .
        " 22000287    02" .
        " 22000287    02" .
        " 22000287    06" .
        " 22000287    06" .
        " 22000288    02" .
        " 22000288    02" .
        " 22000296    02" .
        " 22000296    02" .
        " 22000296    02" .
        " 22000302    04" .
        " 22000302    04" .
        " 22000305    02" .
        " 22000305    02" .
        end
        --------

        Comment


        • #5
          Thank you for your help; the destring ignore option worked.

          Comment


          • #6
            Originally posted by Raphael George View Post
            There are missing values for case_fe
            It's because your -dataex- example wasn't representative. -dataex- is better than nothing, I suppose, but it's a poor substitute for posting the dataset, which is why I don't particularly like the insistence on its use.

            Comment


            • #7
              I also wish to know how to see how many observations there are per case_fe, to see if there are multiple observations per case_fe in order to run a FE

              Comment


              • #8
                Code:
                duplicates report case_fe

                Comment


                • #9
                  Originally posted by Raphael George View Post
                  Thank you for your help; the destring ignore option worked.
                  Actually, the "too many values" error message of encode in #1 above is troubling sign, and reminds me that destring is in general not the most foolproof approach to generating numeric IDs from long string-of-numerals IDs.

                  When integer storage types fail because their range is exceeded, as in generate long case_fe = real(subinstr(caseid, " ", "", .)), their failure is obvious, hence,the precautionary assert !mi(case_fe in #2 above.

                  But when floating-point numeric storage types fail for the analogous reason (accuracy is exceeded), their failure is much more insidious and can unwittingly lead to incorrect analyses due to their inability to distinguish between separate entities.

                  So, for the general task of creating numeric IDs from long strings, I recommend going with an alternative approach—something like the following.
                  Code:
                  preserve
                  
                  contract caseid, freq(dups)
                  // tabulate dups
                  
                  generate long case_fe = _n
                  assert !mi(case_fe)
                  
                  tempfile IDs
                  quietly save `IDs'
                  
                  restore
                  
                  merge m:1 caseid using `IDs', assert(match) nogenerate noreport
                  The same can be accomplished using frames if you're comfortable with those.

                  Originally posted by Raphael George View Post
                  I also wish to know how to see how many observations there are per case_fe, to see if there are multiple observations per case_fe in order to run a FE
                  This approach as the added benefit of answering that question, too, in passing (commented out in green in the code snippet above).

                  My comment about dataex is admittedly curmudgeonly, but if it's to be the sole official recommendation for the forum then I do believe that it could be improved in assuring representativeness in presenting troublesome datasets, for example, by having it formally randomly select observations from the dataset if the dataset is larger than the N that the user opts to show (100 by default). Perhaps its ado-file could be modified to (1) capture the random number generator's state at the beginning of execution, (2) set the seed explicitly (for reproducibility), (3) randomly select the sample from the entire dataset, (4) display the selection as usual but also the seed in the header (* Example generated by -dataex-. seed used in selecting representative random sample: 12345 To install: ssc install dataex) and then (5) restore the random number generator's state before exit. Of course, it would be liable to fail to restore the generator state in the case of a user's pressing the Break key midway (I don't think that preserve does anything with the generator state), but there could be a warning to that effect in its invocation.

                  Comment


                  • #10
                    The comment in #4 about dataex was a small surprise to me as Joseph Coveney is easily one of the nicest and most diplomatic people here and it is perhaps the only very slightly sharp comment he has made in several thousand posts on Statalist, extending back some years into the years when Statalist was email-based.

                    dataex has an unusual status as community-contributed software distributed with Stata. It was Robert Picard's idea and I am very much second author. But any changes I think need to be changes agreeable to the authors and to StataCorp.

                    My own inclination is to leave dataex alone. It is outstandingly important that while it needs to be versatile enough to give good examples, it is also needs to be simple enough to be understandable with minimum effort, especially as beginners to Stata may be asked to use it.

                    The help for dataex already explains how to use it to select a sample randomly. The fact that such selection seems rarely done can be interpreted positively as dataex often working reasonably or negatively that that is a complication too far, users being either unwilling to read about it or unable to understand it.

                    Sometimes dataex does not work well. People fail to give a big enough sample or to include variables that are important to a problem or to include data that show the problem. These are inevitable problems for many users that usually have simple solutions that can be explained.

                    The bottom line is that posting .dta files (even less posting Excel-type files) is not usually a good idea, for reasons documented at length.

                    Comment

                    Working...
                    X