Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting NHANES drug level data to personal level data

    Hello,
    (First-time poster to the forum. I have reviewed the guidelines in "Advice on Posting" https://www.statalist.org/forums/help#stata but my sincere apologies if I have not gotten something right!)

    I am a student working with a NHANES derived dataset on prescription drug use. Its drug level data needs to be converted to personal level data (i.e. a record for each person) before merging it with NHANES demographic data files by the unique identifier for each individual (variable called seqn).
    "Analysts should convert a drug level data to a personal level data, that is, a record for each person, before merging it with NHANES demographic and other data files by using SEQN.”
    Source: https://wwwn.cdc.gov/nchs/nhanes/2009-2010/RXQ_RX_F.htm

    This is because each individual may take more than one medication. If that occurs, each medication is listed on its own line with the seqn. Within my particular dataset, the number of drugs an individual is on ranges from 0 to 20.

    This is an example of how the dataset is currently organized (using medication examples for two individuals, real dataset has 20k)
    seqn rxduse rxddrgid rxdcount
    73557 1 d00262 2
    73557 1 d04113 2
    73557 1 d00262 4
    73558 1 d04538 4
    73558 1 d00746 4
    73558 1 d03182 4

    rxduse refers to whether someone is using a prescription medication or not, rxddrgid is a string variable referring to the the specific type of medication (each entry corresponds to a medication in a codebook), and rxdcount is the total number of medications the individual is on.

    I would like to have it this way
    seqn rxduse rxdcount rxddrgid_1 rxddrgid_2 rxddrgid_3 rxddrgid_4
    73557 1 2 d00262 d04113 . .
    73558 1 4 d00262 d04538 d00746 d03182


    I have reviewed posts in the forum, looked at the help system, and visited various sites such as
    https://stats.idre.ucla.edu/stata/se...ta-management/
    https://cph.osu.edu/sites/default/fi...DataMartin.pdf

    Based on reading, I thought this would be an option
    gen num =_n, over (seqn)
    reshape wide rxddrgid, i(seqn) j(num)

    but got the message after the first line
    options not allowed
    r (101);

    Thank you very much in advance. Any advice would be very greatly appreciated! Additionally, I recognize there are likely multiple ways to achieve this data restructuring so please let me know if I should take a different approach entirely.


    Sincerely,
    Kate L. Taylor

  • #2
    Well, you're on the right track, but you got the syntax wrong. The -gen- command does not allow an -over()- option. (The error message, however, is not literally true because it does allow -before()- and -after()- options!) I think what you want is:

    Code:
    by seqn, sort: gen num = _n
    reshape wide rxddrgrid, i(seqn) j(num)
    But you are not going to be able to run that either because, if your example is real and represents your data, there is something seriously wrong with the rxdcount. By your definition, as the total number of medicines the person is taking, it should be a) constant over all observations on the same person (seqn), b) equal to the number of observations that person has. But your first seqn, 73557 has two different values of rxdcount, 2 and 4, neither of which is equal to his/her number of observations (3)! When you try to -reshape- the data, Stata will notice the inconsistency and will give you an error message and refuse to proceed. (The value of rxdcount for 73558 is also incorrect as there are only 3 observations but rxdcount is 4. This is a problem, but not one that -reshape- will care about. -reshape- does not know or care what your variables mean, but it does insist that any variables not mentioned in the -reshape wide- command must be consistent within observations having identical values on the -i()- variables.)

    So you need to fix the rxdcount problem before you can proceed. Now, if you don't really need the rxdcount variable for anything going forward, one way to "solve" this problem is to just -drop- it. But that's not an ideal solution: no apartment has just one cockroach. The presence of these erroneous values in your data suggests that something went wrong in the data management process (either at CDC or your own) and you cannot trust that the rest of the data is correct. So I would chase down this error and find a way to correct it, or to at least assure yourself that it is an isolated error that does not affect the validity of the rest of the data set.

    Added: I should also point out that, notwithstanding the advice that came with your data, if your patient-level data set has only one observation per patient, you do not have to reduce this data set to that arrangement. You can do:

    Code:
    use person_level_data_set, clear
    merge 1:m seqn using drug_level_dataset
    to put them together. The resulting merged data set will contain as many observations for each seqn as there are in the drug level data set (or 1 for any seqn with nothing correspnding in the drug level data set). Now, it may be that you have other reasons to then go on and -reshape wide- anyway. If that is so, it doesn't matter whether you -reshape- before you -merge- or -merge- before you -reshape-. But let me caution you: most data management and analysis in Stata is easier, often much easier, and sometimes only possible, with the data in long layout. So I encourage you to think ahead to what you will actually be doing with this data. Unless you have a compelling reason to put it into wide layout data, your default should be to leave it in long layout. It would be truly senseless to -reshape wide- unnecessarily before -merge-ing, only to -reshape long- after that.
    Last edited by Clyde Schechter; 13 Oct 2017, 16:18.

    Comment


    • #3
      The key is to include all the variables that are tied to individual records. For my analyses only the category level indicators and the rxddrgid were important. Therefore, I dropped all the other variables and used the same command as supplied by Clyde. That is
      by seqn, sort: gen num = _n" However ,the reshape command included multiple variable names in keeping with the syntax requirements. reshape wide rxddrgid rxddci1a rxddci1b rxddci1c rxddci2a rxddci2b rxddci2c rxddci3a rxddci3b rxddci3c rxddci4a rxddci4b rxddci4c, i(seqn) j(num)

      Last edited by Sri Mummadi; 25 Feb 2018, 17:46.

      Comment

      Working...
      X