Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Measure of similarity between observed values on different variables

    Dear Statalist.

    I am trying to generate a new variable (“wanted”) that captures for each individual (“id”) the extent to which the observed characteristic “char” on “id” is similar to an observed characteristic “char_r” on “id” that is measured from a reference group.
    The observed characteristics “char” and “char_r” may take on either positive or negative values. I am planning to use “wanted” in a regression model as a measure of how similar “id” is to its reference group. In this respect, the values of “wanted” need to be meaningful or comparable across different “id”. Here is a small and simplified toy dataset that illustrates the nature of the data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id double(char char_r)
    1 -.1751724 -.0768999
    2 -.1751724  .0705089
    3  .0213727  .0642464
    4  .0213727 -.0277636
    end
    Any and all suggestions or comments regarding this problem are very welcome.
    Thanks!

  • #2
    I am not sure that I follow this, but it seems that each identifier occurs once, in which case the difference between value and reference value is to me the most obvious measure to use.

    I don't know whether having a different sign is a big deal, or even a little deal, or whether other values mean that some relative measure makes more sense.

    Comment


    • #3
      Thank you, Nick, for making me realize that the previous example probably was too simplified. The data structure is actually panel data with one observation of "id" per period. The reference group observation value stays the same for all "id" in a given period, but changes across period. I also thought about the value difference by "id" and "period", but is this the best or only option? Revised data structure example:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte(id period) double(char char_r)
      1 1 -.1751724 -.0768999
      2 1  .0213727 -.0768999
      3 1  .0413727 -.0768999
      1 2  .2751724  .0705089
      2 2  .3451724  .0705089
      3 2  .0032599  .0705089
      4 2   .012287  .0705089
      end
      Thanks again!

      Comment


      • #4
        OK, so this seems to turn into a twist on standard questions about measuring variability. You might want to work with say

        Code:
        egen  variab1 = mean(abs(char - char_r)), by(id) 
        
        egen  variab2= mean((char - char_r)^2), by(id)
        replace variab2 = sqrt(variab2)  

        Comment


        • #5
          Thank you so much for your suggestions Nick. I greatly appreciate it. Would it be possible to calculate something along these lines (not the mean) by id and period? So that I get a period specific measure for each id? Thanks again.

          Comment


          • #6
            Just change what you feed to by().

            Comment


            • #7
              But if you have one observation per identifier per period, aren't we back where we were at #2?

              Comment


              • #8
                Almost back to #2.
                I am thinking about this tweak of your code, Nick, to get the positive value difference (distance) between the values for all observations (including the negative ones):

                Code:
                bys id period: gen variab = (char - char_r)^2   
                replace variab = sqrt(variab)
                Unless it is a really bad idea on my part, I might try this one.
                Thanks again.

                Comment


                • #9
                  The bysort id period: will do nothing there to change the result from what you would get directly with

                  Code:
                   
                   gen variab = abs(char - char_r)

                  Comment


                  • #10
                    Thank you Nick!
                    I was not aware of the abs( ) option. Very nice.
                    Best regards.

                    Comment


                    • #11
                      Good, but it's a function and documented as such.

                      Code:
                      help abs()

                      Comment


                      • #12
                        I am trying to estimate fix effect by using the dummy variable of countries but when I run regression i get the following results, can anyone help me regarding this.
                        Attached Files

                        Comment

                        Working...
                        X