Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable types and best practice in handling changes of these in dataset

    Dear members, hello again!

    As a new Stata programmer, I have a new question, more about correct procedures and code:
    - changing a variable type from date (%tm) back to a string, to adjust to pre-written code
    - or, if should I even do that or go back and change initial data

    I have code that I wrote for my research (with the help of many Statalist members, last year). This code uses several times interactions between two string variables (yrmonth, rtype)

    For this new research, I prepared the data with yrmonth variable in %tm not in string. (The data this time around was initially in days. So, I extracted the yrmonth from a date %td.)

    I see two solutions:
    1) go back and change all the initial date to include a string field for month and year also (where now I have DMY dates),
    2) change the date yrmonth format to string

    What is the best practice suggested?
    And if item 2, please, help me with this simple code:

    For changing the date (yrmonth) to string, I tried:

    Code:
     gen str_yrmonth=string(yrmonth)



    Results came out like:

    Code:
    str_yrmonth
    575
    576
    577
    578

    I am considering just leaving the data like this, since I will be using ordering and filtering for which Stata recognizes the periods.

    Does anyone foresee a problem?


    I am using Stata 12.1 for Mac.

    Thank you for any tips!

    Regards,
    Clarice

  • #2
    Clarice,

    It's not entirely clear what you are trying to do here (not knowing what it was you were trying to do last time) and I can't picture how you could look at interactions between string variables, but I would always prefer to use numeric variables formatted as dates rather than use string variables. Among other reasons, you can do arithmetic on numeric dates (e.g., differences between two dates) and you can sort in chronological order while still being able to see the actual date (not just a number of days/months/milliseconds since some time in the past). Whether that is applicable or feasible in your case is hard to tell.

    As far as your conversion of yrmonth to string is concerned, what you have done is fine as long as you don't plan to do any arithmetic and done care to know what date is actually being represented. An alternative is to do string(yrmonth,"%tm"), which will show you the date but which will not longer sort in chronological order. Pick your poison...

    Regards,
    Joe

    Comment


    • #3
      Thanks, Joe!!!

      I get your point. It already helps me a lot.

      Sorry, for not being clear, I guess I am always trying to be objective (not to take anyone's time in excess) and end up being to succinct.

      The procedure I am studying is called "portfolio sorts", in asset pricing. And what we do is sort portfolios of returns by different types of anomalies (key variables) after this sort we analyze statistically the performance of top and bottom performers (portfolios ranked higher or lower by anomaly in discussion).

      I guess, that is why the procedure was done with strings.

      Just FYI this is an example:

      Code:
      **divide data in subsets = quintiles
      **code built with colaboration of Statalist members (Nov, 2013)
      
      gen quintile = .
      quietly levelsof yrmonth, local(levs)
      quietly foreach lev of local levs {
             xtile work = return if rtype=="formation" & yrmonth == "`lev'", n(5)
             replace quintile = work if rtype=="formation" & yrmonth == "`lev'"
             drop work
      }

      Since I have a variable date already in my dataset, I am considering to leave the string yrmonthjust to satisfy the code.

      Thanks a lot..!!

      And hope to get better on my future explanations...!!! (That is why I still hesitate in answering questions... I am still a bit raw in my knowledge.)

      Rgs,
      Clarice

      Comment


      • #4
        Clarice,

        Don't worry about bombarding us with too many details. More details are almost always more useful in helping to solve a problem. Just put the most important information first and then we can ignore the details if they are not needed.

        I don't see anything in your code that requires the use of strings. I also took a quick look through the Statalist archives to see what advice was given last year and I don't see any discussion of whether strings are necessary here. The levelsof command works with both strings and numeric variables and when used with dates should give the same result whether coded as string or numeric. If you were to use a numeric variable instead you would just need to remember to get rid of the quotation marks in the if qualfier (e.g., yrmonth==`lev' instead of yrmonth=="`lev'").

        Regards,
        Joe

        Comment


        • #5
          Wow.... Joe... great....!!

          The problem was the quotation marks.

          When I ran the code before with yrmonth (date) and rtype (string), gave me error "type mismatch" and I assumed it was because of different types of variables.

          You are corrected there was no discussion about the strings. I was just helped with the variables I had at the time, which were strings.
          (I just mentioned the use of Statalist, 'cause I wanted to give proper credit to getting at this fancy code, which I didn't do by myself.)

          Thank you so much!

          Well, just learned several things today!!!

          Comment


          • #6
            I am fond of levelsof but I'd point to this method as more general and less error-prone:

            Code:
             
            gen quintile = .
            egen levs = group(yrmonth) if rtype == "formation"
            su levs, meanonly
            
            quietly forval lev = 1/`r(max)' {
               xtile work = return if levs == `lev', n(5)
               replace quintile = work if levs == `lev'
               drop work
            }
            The key advantages are
            1. Works with numeric and string variables alike. So you just cycle over groups labelled 1 up.
            2. Works with combinations of two or more variables. They are just extra arguments to group().
            3. Specify any repeated qualifiers just once.
            For more, see http://www.stata.com/support/faqs/da...-with-foreach/

            Comment


            • #7
              Thanks, Nick...

              I am working on a new code now, and will try this version out.

              Thanks for the reference as well.

              Comment

              Working...
              X