Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding use of rangestat to replicate match() behavior in R or Excel

    Dear Statalist users,

    Some time ago, when I was about to graduate from university, I attended a virtual course on working with household survey data. In that course, the instructor shared some code using the `rangestat` command, but unfortunately, it wasn't explained in detail.

    What confused me is that `rangestat` was used in a way that seemed equivalent to the `match()` function in R, or the `MATCH()` function in Excel. I have read the documentation for `rangestat`, and I can't figure out how it is supposed to replicate that kind of functionality. I'm not sure if I misunderstood something or if there is a specific use case that allows for this behavior.

    For context, here is what each variable represents:

    - `DIRECTORIO`: unique dwelling identifier (each physical housing unit).
    - `SECUENCIA_P`: household number within the dwelling (in some cases, more than one household lives in the same housing unit).
    - `ORDEN`: the person number or ID within the household (e.g., the household head is usually 1).
    - `P6040`: the age of the person.
    - `P1134S1A1`: the `ORDEN` number of the household member that this person helps to dress (i.e., caregiving relationship).

    In other words, for each person in the dataset, `P1134S1A1` refers to another member of the *same household*, and the goal is to retrieve that person’s age (`P6040`) and assign it to a new variable.

    Below, I provide a minimal example of my dataset and the code used, so I can better understand how `rangestat` is functioning in this context.

    ```stata
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long DIRECTORIO byte(SECUENCIA_P ORDEN) int P6040 byte P1134S1A1
    2915602 1 1 23 .
    2915602 1 2 26 4
    2915602 1 3 8 .
    2915602 1 4 1 .
    2915602 1 5 57 .
    2915602 1 6 22 .
    2915603 1 1 44 .
    2915603 1 2 21 .
    2915604 1 1 51 .
    2915604 2 1 20 .
    2915604 2 2 18 .
    2915604 2 3 12 .
    2915605 1 1 24 2
    2915605 1 2 5 .
    2915605 1 3 2 .
    2915606 1 1 42 .
    2915606 1 2 28 .
    2915606 1 3 12 .
    2915607 1 1 34 .
    2915607 1 2 33 .
    2915608 1 1 67 .
    2915608 1 2 61 .
    2915608 1 3 31 .
    2915608 1 4 8 .
    2915609 1 1 65 .
    2915609 1 2 53 .
    2915609 1 3 19 .
    2915610 1 1 79 .
    2915610 1 2 81 .
    2915610 1 3 13 .
    2915611 1 1 38 .
    2915611 1 2 14 .
    2915611 1 3 7 .
    2915611 1 4 75 .
    2915611 1 5 23 .
    2915612 1 1 46 .
    2915612 1 2 43 .
    2915612 1 3 19 .
    2915612 1 4 14 .
    2915614 1 1 36 .
    2915614 1 2 31 .
    2915614 1 3 11 .
    2915614 1 4 3 .
    2915615 1 1 64 .
    2915615 1 2 35 .
    2915615 1 3 30 .
    2915615 1 4 33 5
    2915615 1 5 4 .
    2915616 1 1 31 .
    2915616 1 2 29 .
    2915616 1 3 11 .
    2915618 1 1 42 .
    2915618 1 2 20 .
    2915618 1 3 15 .
    2915618 1 4 20 5
    2915618 1 5 1 .
    2915619 1 1 52 .
    2915619 1 2 34 .
    2915619 1 3 31 .
    2915619 1 4 2 .
    2915620 1 1 54 .
    2915620 1 2 78 .
    2915621 1 1 36 .
    2915621 1 2 35 .
    2915622 1 1 28 4
    2915622 1 2 31 .
    2915622 1 3 5 .
    2915622 1 4 0 .
    2915622 1 5 46 .
    end

    clonevar lookup_P1134S1A1 = P1134S1A1
    replace lookup_P1134S1A1 = 0 if missing(P1134S1A1)

    rangestat vestir1 = P6040, by(DIRECTORIO SECUENCIA_P) int(ORDEN lookup_P1134S1A1 lookup_P1134S1A1)




  • #2
    Disclaimer: I do not use R at all, and I only use Excel to send and receive data from others who do not have any statistical package that I use. So I cannot verify that the -rangestat- code resembles the R and Excel functions you mention. But I will give you a correct description of what it does in your data and how the result would be interpreted.

    The general functioning of the -rangestat- command is this: to each observation in the data set, it associates the set of observations in the entire data set that agree with it on the variables specified in the -by()- option and for which the inequality specified in the -interval()- option holds. It then calculates the statistics requested before the comma and saves them in the data set under the names specified, or under default names if no naming was specified.

    Here is what it does in your data set. P6040 is the variable for age. The -rangestat- command is telling Stata, for each observation in the entire data set, to find the mean value of variable P6040 in the set of observations that agree with the current observation on the values of DIRECTORIO and SECUENCIA_P, and for which the inequality lookup_p1134S1A1 <= ORDEN and ORDEN <= lookup_P1134S1A1. The inequality, having lookup_P1134S1A1 both >= and <= ORDEN reduces, in this case, to an equality: ORDEN == lookup_p1134S1A1. Now, we have to be clear, the first argument of the -interval()- option, in this case, ORDEN, refers to the value of ORDEN in the observation included in the data set being built up for the calculation, whereas the second and third arguments (both lookup_P1134S1A1) refer to the observation for which the set of observations is being created.

    So, to each observation (call it focal observation), Stata will associate all observations in the data set that have the same DIRECTORIO and SECUENCIA_P values as in the focal observation, and in which the value of ORDEN is equal to the focal observation's value of lookup_P1134S1A1. Since lookup_P1134S1A1 was created to be the value of ORDEN corresponding to the dependent taken care of in the household, this means that the data set associated to any observation will be the set of observations that are about the dependent in the same household. And from among those observations, it will calculate the mean age, and save it in a new variable vestir1. So in each observation, vestir is set to the mean age of all the household members that the person designated by ORDEN in the observation helps dress (gives care to).

    Now, I note that in your example data set, observations are uniquely identified by the combination of DIRECTORIO, SECUENCIA_P and ORDEN. It follows that each person only dresses at most one person in the household. If that is just a coincidence in your example, then you need read no farther. But if that is in fact part of the design of your data set, then, although the -rangestat- code works properly, I would do it differently:
    Code:
    frame put DIRECTORIO SECUENCIA_P ORDEN P6040, into(working)
    frlink m:1 DIRECTORIO SECUENCIA_P P1134S1A1, frame(working DIRECTORIO SECUENCIA_P ORDEN)
    frget vestir2 = P6040, from(working)
    This code presumes and verifies that there is at most one person in the household being taken care of by a given person in the household, and if that is violated, the -frlink- command will find any such violations of the assumption and will halt with an error message. Assuming that the data follow the uniqueness assumption, it will find that unique person being dressed in frame working and return that person's age without going through the motions of calculating the mean of a single observation. So, if you have this uniqueness in your design, this code is safer because it will tell you if you have violations of the assumption in your data set and halt calculation, whereas the -rangestat- command will not recognize the problem and just blunder on, averaging the ages of the multiple people it identifies. Of course, if the uniqueness assumption is not true of your data design, then you cannot use the -frame- approach and must use -rangestat-.

    Comment


    • #3
      Many thanks to Clyde Schechter for this detailed and careful explanation.

      My case resembles that of Clyde in that I rarely use Excel or R and my concern in using either is to get out as fast as possible and back to Stata, not so much out of disdain for either, but more because I am much more accustomed to Stata. I have never used the specific functions you name.

      Indeed when you write

      What confused me is that `rangestat` was used in a way that seemed equivalent to the `match()` function in R, or the `MATCH()` function in Excel. I have read the documentation for `rangestat`, and I can't figure out how it is supposed to replicate that kind of functionality
      the implication of an intent to replicate functionality elsewhere is news to me. Whether it turns out to do the same thing, or there are some differences, it's my understanding that rangestat was written originally by Robert Picard to solve a class of problems arising with data in Stata.

      If Robert tells us otherwise, that's fine too.

      Comment

      Working...
      X