Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping data

    Hey everyone,

    I am new to Stata and despite intensive research in this forum I am facing an issue where I need some help.

    I have a dataset with variables like state_1999 state_2001 state_2003 ... state_2019, family_id_1999 ... family_id_2019, College_1999 ... College_2019, you get the idea (total of 170+ variables).
    state_1999 state_2001 ... state_2019 family_id_1999 ... family_id_2019 College...

    Now I want to reshape the dataset so that I have a "year" variable and the variable "state", "family_id" and "College" (+ the others I didn't mention).
    year state family_id College ...

    I guess the way is to reshape long. But think one problem is that I do not have a variable that uniquely specifies the observations, but tbh I am not sure.

    I would be so happy If someone could give me a hint how to tackle this issue.


    Best regards,
    Ronny

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10(state_1999 state_2001 state_2019) str14(family_id_1999 family_id_2019) str12(College_1999 College_2019) str8(xxx_1999 xxx_2019)
    "1" "2" "3" "4" "5" "6" "7" "8" "9"
    end
    
    *EXTRACT STUBNAMES
    qui ds
    local list= ustrregexra("`r(varlist)'", "_(\d{4})", "_")
    local list: list uniq list
    *RESHAPE
    gen obs_no=_n
    reshape long `list', i(obs_no) j(year)
    *CLEAN UP
    rename *_ *
    drop obs_no
    Res.:

    Code:
    . l
    
         +-----------------------------------------+
         | year   state   family~d   College   xxx |
         |-----------------------------------------|
      1. | 1999       1          4         6     8 |
      2. | 2001       2                            |
      3. | 2019       3          5         7     9 |
         +-----------------------------------------+
    Last edited by Andrew Musau; 18 Jan 2023, 01:28.

    Comment


    • #3
      You can add a variable that uniquely identifies observations. While it is a requirement for reshape (so you can tell where each observation in the long dataset originated) it does not have to be meaningful. Your code will be something like the following.
      Code:
      generate seq = _n
      reshape long state family_id College, i(seq) j(year)

      Comment


      • #4
        Welcome to Statalist, Ronny!

        You might want to try something like this:

        Code:
        gen `c(obs_t)' n = _n //  creates a unique identifier for every row
        reshape long state_ family_id_ College_, i(n) j(year)
        rename *_ *
        drop n
        In the future, it would be helpful for you to provide a data extract using the dataex command. For this and other tips on how to post questions effectively, see the Statalist FAQ, esp. #12.

        Added: crossed with #2 and #3. All three solutions are similar, take your pick!
        Last edited by Hemanshu Kumar; 18 Jan 2023, 01:37.

        Comment


        • #5
          Thanks to you both for your quick answers! @Andrew: Your code worked perfectly for my situation. You saved me so much time and nerves, big thanks!

          Comment


          • #6
            Sorry for being late to the party. While the code in #2 worked (as would that in #3), the code in #4 is better. The reason is that -gen unique_identifier = _n- will, by default, create the unique_identifier as a float. If your data set is very large (more than about 16,000,000 observations), the required values at the end of the data set will be too large to fit in float storage and they will be truncated. As a result the identifiers will no longer be unique.

            For a unique identifier, -gen long unique_identifier = _n- is safer, because -long- has enough available* bits to accommodate any possible value up to about 2 billion. It is rare to have a data set with more observations than that (and impossible in BE or SE flavors of Stata). But routinely using -long- can be wasteful of memory.

            So by using -gen `c(obs_t)' unique_identifier = _n-, as recommended in #4, you get the best of both worlds. Stata will identify the smallest storage type that will accommodate the number of observations in the data set and use that for you.

            (*long and float are both 4 bytes, but in float, one byte is lost to hold the exponent, so the largest integer that can be represented precisely is limited to 3 bytes).

            Comment


            • #7
              Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.
              byte - 7 bits -127 100
              int - 15 bits -32,767 32,740
              long - 31 bits -2,147,483,647 2,147,483,620
              float - 24 bits -16,777,216 16,777,216
              double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992

              Comment

              Working...
              X