Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshape long/wide problem

    Dear all,

    This is the data.

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str21 SeriesID double(Jan2001 Feb2001 Mar2001)
    "JTS000000000000000HIL" 5722 5303 5528
    "JTS000000000000000HIR"  4.3    4  4.2
    "JTS000000000000000JOL" 5234 5097 4762
    "JTS000000000000000JOR"  3.8  3.7  3.5
    "JTS000000000000000LDL" 2220 1855 2133
    "JTS000000000000000LDR"  1.7  1.4  1.6
    "JTS000000000000000OSL"  377  323  379
    "JTS000000000000000OSR"   .3   .2   .3
    "JTS000000000000000QUL" 3245 3053 3054
    "JTS000000000000000QUR"  2.4  2.3  2.3
    "JTS000000000000000TSL" 5842 5231 5566
    "JTS000000000000000TSR"  4.4  3.9  4.2
    "JTS000000000000000UOR"  1.2  1.2  1.3
    end
    ------------------ copy up to and including the previous line ------------------

    I want to transpose it. So that I have "Year" and all 13 series ID as variables.

    I did:
    Code:
    rename (Jan2001-Aug2023) value=
    reshape long value, string i(SeriesID) j(year)
    And I get:

    Code:
    ----------------------- copy starting from the next line -----------------------
    
    
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str21 SeriesID str7 year double value
    "JTS000000000000000HIL" "Apr2001" 5204
    "JTS000000000000000HIL" "Apr2002" 4928
    "JTS000000000000000HIL" "Apr2003" 4600
    "JTS000000000000000HIL" "Apr2004" 5152
    "JTS000000000000000HIL" "Apr2005" 5362
    "JTS000000000000000HIL" "Apr2006" 5215
    "JTS000000000000000HIL" "Apr2007" 5353
    "JTS000000000000000HIL" "Apr2008" 4948
    "JTS000000000000000HIL" "Apr2009" 3888
    "JTS000000000000000HIL" "Apr2010" 4159
    "JTS000000000000000HIL" "Apr2011" 4296
    end
    ------------------ copy up to and including the previous line ------------------
    However, I want SeriesIDs as variables.
    Like below:

    Year JTS000000000000000HIL "JTS000000000000000HIR JTS000000000000000JO
    . . . .
    .
    . . . .

    I appreciate any help. Thanks.

  • #2
    The basic approach is
    Code:
    rename (Jan2001-Mar2001) _=
    reshape long _, i(SeriesID) j(month_year) string
    reshape wide _, i(month_year) j(SeriesID) string
    rename _* *
    You may need to tinker with your data in some ways to make this workable. In order for this code to run, the following must all be true of SeriesID:
    1. It uniquely identifies observations in your data set
    2. It never contains any characters that are forbidden in variable names (i.e. anything other than _, letters, and digits).
    3. It never exceeds 31 characters in length.
    If any of those three constraints are not met, you will have to modify the contents of SeriesID to make them true.

    Comment


    • #3
      Thanks a lot, Clyde. It worked like a charm!

      Comment


      • #4
        Hi Clyde,

        May I ask a follow up question? How to modify the contents of the SeriesID (or line_description is the variable this time) whose values look like this: "10-Personal income (thousands of dollars)$1000s". So, basically, none of the 3 conditions are met.

        Code:
        . reshape wide _, i(countyfips countyname region year) j(linecode_description_unit)
        > string
        string j variable linecode_description_unit not allowed to contain spaces;
            encode string variable linecode_description_unit into a numeric variable and
            use the new numeric variable as the j variable
        r(111);

        Code:
        . reshape wide _, i(countyfips countyname region year) j(linecode_description_unit) 
        (j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3
        > 0 31)
        
        Data                               Long   ->   Wide
        -----------------------------------------------------------------------------
        Number of observations        5,158,400   ->   166,400     
        Number of variables                   6   ->   35          
        j variable (31 values)linecode_description_unit->(dropped)
        xij variables:
                                              _   ->   _1 _2 ... _31
        -----------------------------------------------------------------------------
        
        . 
        end of do-file
        
        . do "/var/folders/_v/v4352p5n62v_39tc3lqq86j40000gn/T//SD02413.000000"
        
        . rename _* *
        31 new variable names invalid
            You attempted to rename _1 to 1, _2 to 2, ...  Those are invalid Stata
            variable names.
        r(198);
        When I convert it to numeric, and do reshape, I get this error:


        Thanks.
        Last edited by Lars Pete; 13 Oct 2023, 18:49. Reason: Adding codes that give error

        Comment


        • #5
          In both of these instances you are coming up against the rules that govern Stata variable names. Only letters, digits and the underscore (_) character are allowed, and, in addition, the first character may not be a digit. And the length may not exceed 32 characters.

          In your second approach, following the advice to -encode-, things were fine until you tried to get rid of the initial underscore characters at the end: that would have led to variable names beginning with digits, which is not allowed. If you just omit that last -rename- you will have something that might be workable for you. The main problem is that the variable names will be _1, _2, ... etc. And you will have the cognitive burden of having to always remember what those mean in terms of the original values of SeriesID.

          A potentially better approach is, instead of numericalizing SeriesID, is to edit it so that it will be compatible with becoming variable names. You face two challenges in this regard: excessive length and disallowed characters. There is a Stata function to handle each of these. You can combine them into a single command:
          Code:
          replace SeriesID = substr(strtoname(SeriesID), 1, 31)
          The -strtoname()- function replaces all disallowed characters by underscores. And the -substr()- function truncates that result to its first 31 characters (which leaves room for the _ character that -reshape- will prefix.

          Now, this approach may not be entirely satisfactory. Particularly damaging would be a situation where two different values of SeriesID are the same in their first 31 characters. That would result in the confounding of those two series. And -reshape- will pick this up as an error because you will no longer have unique identification. If you have a situation like this, you will have to use some -replace- statements to edit the offending values of SeriesID into something that retains their similarity to the original values enough that you can recognize it but also preserves their distinctiveness. You will have to use your own judgment how best to do that.

          Comment


          • #6
            Hi Clyde,
            Thanks a lot. I just thought that the best approach would be to rename the vars after reshaping wide since they were too long. So, I just did that.

            Comment

            Working...
            X