Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble with Transpose - Help

    I've been trying to get out of Excel and managing all my data transformations within Stata. But I can't figure out how to do a few things that I can't seem to find online: at least anything that worked for me.

    For Zillow time-series datasets the default format is to include columns for each month ex. 1996m4, 1996m5, 1996m6 etc. So in Stata this of course is a variable for each corresponding column. Naturally I needed to transpose this so that I have a variable holding all time information. For both Xpose or reshape the commands don't come out correctly to show me Year on a new variable and a delta of 1 mo. b/w obs. I got stuck at stubnames and not knowing how to prefix only selected columns that denoted time, with X, so I could indicate which variables I needed reshaped. The closest I got was with xpose but it generated dozens of new varnames not correctly specifying Year; or not including the first observation as the variables name .etc. What's the best way of going about this transformation? How do you prefix or suffix a group of vars but not all of them to use reshape?

    Zillow Zip Data ex:
    Zip Code, Size Rank, County, City, State, 1996m4, 1996m5, 1996m6... etc

    Thanks a million for any help.

  • #2
    -xpose- is unlikely to be the solution to your problem. It is far more likely to be -reshape-, the question is how to specify the -reshape- command. You don't actually show any example data, so it isn't possible to give particulars. In fact what you call "Zillow Zip Data ex" isn't even possible as a Stata data set because variable names cannot begin with numbers.

    Please re-post showing an example of the Stata data. Use the -dataex- command to do that! If you are not familiar with -dataex- yet, read FAQ #12 for information.

    Comment


    • #3
      It made me drop ~90% of the variables because of memory and list length but here's an example: H, I J K L also correspond to the exact variable I need coded into 1 variable: time. H is 1996m4, I is 1996m5 etc


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long RegionID str31 RegionName str11 City str2 State str30 Metro str11 CountyName byte SizeRank double(H I J K L)
      274772 "Northeast Dallas"                "Dallas"      "TX" "Dallas-Fort Worth"              "Dallas"       1        .        .        .        .        .
      192689 "Paradise"                        "Las Vegas"   "NV" "Las Vegas"                      "Clark"        2   71.741  71.7994  70.7296   69.945  70.3616
      112345 "Maryvale"                        "Phoenix"     "AZ" "Phoenix"                        "Maricopa"     3  46.6023  47.1911  47.1452  47.9351   48.122
      270958 "Upper West Side"                 "New York"    "NY" "New York"                       "New York"     4 267.2269  284.368 282.8276 281.5681 269.4598
       27080 "Sherman Oaks"                    "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles"  5 105.4025 102.1831   99.784 101.0633 103.6897
      118208 "South Los Angeles"               "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles"  6  93.7793  97.3819  96.3294  94.6816   95.331
      270957 "Upper East Side"                 "New York"    "NY" "New York"                       "New York"     7 327.3135 349.4798  361.176   337.92 329.4652
      192820 "Sunrise Manor"                   "Las Vegas"   "NV" "Las Vegas"                      "Clark"        8  69.8048  70.4609  71.1245  69.8857  70.2301
      194430 "East New York"                   "New York"    "NY" "New York"                       "Kings"        9        .        .        .        .        .
      192796 "Spring Valley"                   "Las Vegas"   "NV" "Las Vegas"                      "Clark"       10  75.5574  75.1731  76.0072  76.1621  76.7485
      268496 "Southeast Los Angeles"           "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 11  95.4589  91.2375  95.1622  95.5695  99.1671
      112547 "Paradise Valley"                 "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    12  70.7209  71.1481  71.3264  71.4925  72.3798
      111818 "Deer Valley"                     "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    13  63.2806  63.5588  63.7216  63.2889  64.4468
      198687 "Washington Heights"              "New York"    "NY" "New York"                       "New York"    14        .        .        .        .        .
       41466 "Van Nuys"                        "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 15        .        .        .        .        .
      272816 "Astoria"                         "New York"    "NY" "New York"                       "Queens"      16        .        .        .        .        .
      267955 "North Mountain"                  "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    17  55.4359  55.7245  55.7523  56.7483  57.7994
      273698 "Far North"                       "Dallas"      "TX" "Dallas-Fort Worth"              "Dallas"      18        .        .        .        .        .
       32059 "Hollywood"                       "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 19 120.7376  113.211  103.331        .  81.0699
      199001 "Williamsburg"                    "New York"    "NY" "New York"                       "Kings"       20        .        .        .        .        .
      272902 "Bedford Stuyvesant"              "New York"    "NY" "New York"                       "Kings"       21        .        .        .        .        .
      267908 "Camelback East"                  "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    22  63.0652  61.2435  62.4261  63.1033  62.8148
      272734 "Alhambra"                        "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    23  49.1339  48.6942  48.1377  49.7397  51.0262
      194737 "Flatbush"                        "New York"    "NY" "New York"                       "Kings"       24        .        .        .        .        .
      403222 "Crown Heights"                   "New York"    "NY" "New York"                       "Kings"       25        .        .        .        .        .
      275473 "Southeast Dallas"                "Dallas"      "TX" "Dallas-Fort Worth"              "Dallas"      26        .        .        .        .        .
       25294 "Jamaica"                         "New York"    "NY" "New York"                       "Queens"      27   87.024        .        .        .        .
      195267 "Harlem"                          "New York"    "NY" "New York"                       "New York"    28        .        .        .        .        .
      275237 "Roosevelt"                       "Fresno"      "CA" "Fresno"                         "Fresno"      29  52.8075  54.0029  54.8774  54.3057  53.5087
      343222 "Williamsbridge"                  "New York"    "NY" "New York"                       "Bronx"       30        .        .        .        .        .
      270828 "East Harlem"                     "New York"    "NY" "New York"                       "New York"    31        .        .        .        .        .
      274742 "North Scottsdale"                "Scottsdale"  "AZ" "Phoenix"                        "Maricopa"    32  96.1294  96.1086  96.7989  96.1687  97.8862
      194379 "East Flatbush"                   "New York"    "NY" "New York"                       "Kings"       33        .        .        .        .        .
       35521 "Enterprise"                      "Las Vegas"   "NV" "Las Vegas"                      "Clark"       34   82.651  85.4945  82.2642  78.9182  81.4286
      273565 "East San Jose"                   "San Jose"    "CA" "San Jose"                       "Santa Clara" 35 128.5023 129.5438 131.6951 131.2723 131.9024
      267971 "South Mountain"                  "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    36  48.4479  50.6133  50.9066  51.4026  50.2977
      273555 "East Memphis-Colonial-Yorkshire" "Memphis"     "TN" "Memphis"                        "Shelby"      37  46.4234  47.2015  49.2321  50.0269  50.1466
       10957 "Corona"                          "New York"    "NY" "New York"                       "Queens"      38        .        .        .        .        .
      270945 "Sunset Park"                     "New York"    "NY" "New York"                       "Kings"       39        .        .        .        .        .
       46795 "North Hollywood"                 "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 40  97.5472 100.0375 102.4289 102.9208  96.9305
      end

      Much Thanks

      Comment


      • #4
        OK. That's good. So there are a couple of obstacles that have to be overcome. Because you are importing from Excel, you get variable names H I J K L (and I'm guessing in your full data set there are even more of these). So those are difficult to work with and need to be renamed systematically in a way that reflects a sequence of times. Then we need to -reshape long-. It is possible that some combination of the variables RegionID through SizeRank uniquely identifies observations in the data. But it isn't necessary to worry about that. It's easier to just generate a new variable containing the observation number instead and use that as the -i()- option of reshape.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long RegionID str31 RegionName str11 City str2 State str30 Metro str11 CountyName byte SizeRank double(H I J K L)
        274772 "Northeast Dallas"                "Dallas"      "TX" "Dallas-Fort Worth"              "Dallas"       1        .        .        .        .        .
        192689 "Paradise"                        "Las Vegas"   "NV" "Las Vegas"                      "Clark"        2   71.741  71.7994  70.7296   69.945  70.3616
        112345 "Maryvale"                        "Phoenix"     "AZ" "Phoenix"                        "Maricopa"     3  46.6023  47.1911  47.1452  47.9351   48.122
        270958 "Upper West Side"                 "New York"    "NY" "New York"                       "New York"     4 267.2269  284.368 282.8276 281.5681 269.4598
         27080 "Sherman Oaks"                    "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles"  5 105.4025 102.1831   99.784 101.0633 103.6897
        118208 "South Los Angeles"               "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles"  6  93.7793  97.3819  96.3294  94.6816   95.331
        270957 "Upper East Side"                 "New York"    "NY" "New York"                       "New York"     7 327.3135 349.4798  361.176   337.92 329.4652
        192820 "Sunrise Manor"                   "Las Vegas"   "NV" "Las Vegas"                      "Clark"        8  69.8048  70.4609  71.1245  69.8857  70.2301
        194430 "East New York"                   "New York"    "NY" "New York"                       "Kings"        9        .        .        .        .        .
        192796 "Spring Valley"                   "Las Vegas"   "NV" "Las Vegas"                      "Clark"       10  75.5574  75.1731  76.0072  76.1621  76.7485
        268496 "Southeast Los Angeles"           "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 11  95.4589  91.2375  95.1622  95.5695  99.1671
        112547 "Paradise Valley"                 "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    12  70.7209  71.1481  71.3264  71.4925  72.3798
        111818 "Deer Valley"                     "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    13  63.2806  63.5588  63.7216  63.2889  64.4468
        198687 "Washington Heights"              "New York"    "NY" "New York"                       "New York"    14        .        .        .        .        .
         41466 "Van Nuys"                        "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 15        .        .        .        .        .
        272816 "Astoria"                         "New York"    "NY" "New York"                       "Queens"      16        .        .        .        .        .
        267955 "North Mountain"                  "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    17  55.4359  55.7245  55.7523  56.7483  57.7994
        273698 "Far North"                       "Dallas"      "TX" "Dallas-Fort Worth"              "Dallas"      18        .        .        .        .        .
         32059 "Hollywood"                       "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 19 120.7376  113.211  103.331        .  81.0699
        199001 "Williamsburg"                    "New York"    "NY" "New York"                       "Kings"       20        .        .        .        .        .
        272902 "Bedford Stuyvesant"              "New York"    "NY" "New York"                       "Kings"       21        .        .        .        .        .
        267908 "Camelback East"                  "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    22  63.0652  61.2435  62.4261  63.1033  62.8148
        272734 "Alhambra"                        "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    23  49.1339  48.6942  48.1377  49.7397  51.0262
        194737 "Flatbush"                        "New York"    "NY" "New York"                       "Kings"       24        .        .        .        .        .
        403222 "Crown Heights"                   "New York"    "NY" "New York"                       "Kings"       25        .        .        .        .        .
        275473 "Southeast Dallas"                "Dallas"      "TX" "Dallas-Fort Worth"              "Dallas"      26        .        .        .        .        .
         25294 "Jamaica"                         "New York"    "NY" "New York"                       "Queens"      27   87.024        .        .        .        .
        195267 "Harlem"                          "New York"    "NY" "New York"                       "New York"    28        .        .        .        .        .
        275237 "Roosevelt"                       "Fresno"      "CA" "Fresno"                         "Fresno"      29  52.8075  54.0029  54.8774  54.3057  53.5087
        343222 "Williamsbridge"                  "New York"    "NY" "New York"                       "Bronx"       30        .        .        .        .        .
        270828 "East Harlem"                     "New York"    "NY" "New York"                       "New York"    31        .        .        .        .        .
        274742 "North Scottsdale"                "Scottsdale"  "AZ" "Phoenix"                        "Maricopa"    32  96.1294  96.1086  96.7989  96.1687  97.8862
        194379 "East Flatbush"                   "New York"    "NY" "New York"                       "Kings"       33        .        .        .        .        .
         35521 "Enterprise"                      "Las Vegas"   "NV" "Las Vegas"                      "Clark"       34   82.651  85.4945  82.2642  78.9182  81.4286
        273565 "East San Jose"                   "San Jose"    "CA" "San Jose"                       "Santa Clara" 35 128.5023 129.5438 131.6951 131.2723 131.9024
        267971 "South Mountain"                  "Phoenix"     "AZ" "Phoenix"                        "Maricopa"    36  48.4479  50.6133  50.9066  51.4026  50.2977
        273555 "East Memphis-Colonial-Yorkshire" "Memphis"     "TN" "Memphis"                        "Shelby"      37  46.4234  47.2015  49.2321  50.0269  50.1466
         10957 "Corona"                          "New York"    "NY" "New York"                       "Queens"      38        .        .        .        .        .
        270945 "Sunset Park"                     "New York"    "NY" "New York"                       "Kings"       39        .        .        .        .        .
         46795 "North Hollywood"                 "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 40  97.5472 100.0375 102.4289 102.9208  96.9305
        end
        
        rename (H-L) X=
        rename X* X#, renumber
        
        gen long obs_no = _n
        reshape long X, i(obs_no) j(month)
        replace month = month + tm(1996m4) - 1
        format month %tm
        So the two -rename- commands replace H through L by X1 through X5. If there are more variables than that, just replace H-L by the actual range of variable names and Stata will renumber accordingly.

        After the reshape command the month variable looks like 1, 2, 3, 4,... The final two lines serve to convert those sequential numbers into a sequence starting with the Stata internal format numerical representation of 1996m4, 1996m5, 1996m6, ..., and then apply a monthly date format so that the human eye can see what the numbers represent.

        If you are going to work with this as a panel data set, -xtset obs_no month- will do the trick.

        Comment


        • #5
          Ahhh thank you so very much. That absolutely solved my problem as well as gave me a perfect example use of the reshape command.

          Now just so I'm clear on a few basic things - you mentioned how the varnames H-L were created that way because I imported as an excel file. How does importing through a CSV change the scenario? I import through CSV and see the options but instead of H-L it's v8 v9 etc.. Is it the fact that they already have "v" in common to be used as a stubname, where you referred to the difference?
          Click image for larger version

Name:	CSV.jpg
Views:	1
Size:	303.4 KB
ID:	1406341

          Comment


          • #6
            Yes, if you used -import delimited- on a .csv you would get varnames v8, v9, ... and then you would not need the initial rename command. Of course, then the post-reshape command would change to:
            Code:
            replace month = month + tm(1996m4) - 8
            and the resulting variable would be named v instead of X.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              OK. That's good. So there are a couple of obstacles that have to be overcome. Because you are importing from Excel, you get variable names H I J K L (and I'm guessing in your full data set there are even more of these). So those are difficult to work with and need to be renamed systematically in a way that reflects a sequence of times. Then we need to -reshape long-. It is possible that some combination of the variables RegionID through SizeRank uniquely identifies observations in the data. But it isn't necessary to worry about that. It's easier to just generate a new variable containing the observation number instead and use that as the -i()- option of reshape.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long RegionID str31 RegionName str11 City str2 State str30 Metro str11 CountyName byte SizeRank double(H I J K L)
              274772 "Northeast Dallas" "Dallas" "TX" "Dallas-Fort Worth" "Dallas" 1 . . . . .
              192689 "Paradise" "Las Vegas" "NV" "Las Vegas" "Clark" 2 71.741 71.7994 70.7296 69.945 70.3616
              112345 "Maryvale" "Phoenix" "AZ" "Phoenix" "Maricopa" 3 46.6023 47.1911 47.1452 47.9351 48.122
              270958 "Upper West Side" "New York" "NY" "New York" "New York" 4 267.2269 284.368 282.8276 281.5681 269.4598
              27080 "Sherman Oaks" "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 5 105.4025 102.1831 99.784 101.0633 103.6897
              118208 "South Los Angeles" "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 6 93.7793 97.3819 96.3294 94.6816 95.331
              270957 "Upper East Side" "New York" "NY" "New York" "New York" 7 327.3135 349.4798 361.176 337.92 329.4652
              192820 "Sunrise Manor" "Las Vegas" "NV" "Las Vegas" "Clark" 8 69.8048 70.4609 71.1245 69.8857 70.2301
              194430 "East New York" "New York" "NY" "New York" "Kings" 9 . . . . .
              192796 "Spring Valley" "Las Vegas" "NV" "Las Vegas" "Clark" 10 75.5574 75.1731 76.0072 76.1621 76.7485
              268496 "Southeast Los Angeles" "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 11 95.4589 91.2375 95.1622 95.5695 99.1671
              112547 "Paradise Valley" "Phoenix" "AZ" "Phoenix" "Maricopa" 12 70.7209 71.1481 71.3264 71.4925 72.3798
              111818 "Deer Valley" "Phoenix" "AZ" "Phoenix" "Maricopa" 13 63.2806 63.5588 63.7216 63.2889 64.4468
              198687 "Washington Heights" "New York" "NY" "New York" "New York" 14 . . . . .
              41466 "Van Nuys" "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 15 . . . . .
              272816 "Astoria" "New York" "NY" "New York" "Queens" 16 . . . . .
              267955 "North Mountain" "Phoenix" "AZ" "Phoenix" "Maricopa" 17 55.4359 55.7245 55.7523 56.7483 57.7994
              273698 "Far North" "Dallas" "TX" "Dallas-Fort Worth" "Dallas" 18 . . . . .
              32059 "Hollywood" "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 19 120.7376 113.211 103.331 . 81.0699
              199001 "Williamsburg" "New York" "NY" "New York" "Kings" 20 . . . . .
              272902 "Bedford Stuyvesant" "New York" "NY" "New York" "Kings" 21 . . . . .
              267908 "Camelback East" "Phoenix" "AZ" "Phoenix" "Maricopa" 22 63.0652 61.2435 62.4261 63.1033 62.8148
              272734 "Alhambra" "Phoenix" "AZ" "Phoenix" "Maricopa" 23 49.1339 48.6942 48.1377 49.7397 51.0262
              194737 "Flatbush" "New York" "NY" "New York" "Kings" 24 . . . . .
              403222 "Crown Heights" "New York" "NY" "New York" "Kings" 25 . . . . .
              275473 "Southeast Dallas" "Dallas" "TX" "Dallas-Fort Worth" "Dallas" 26 . . . . .
              25294 "Jamaica" "New York" "NY" "New York" "Queens" 27 87.024 . . . .
              195267 "Harlem" "New York" "NY" "New York" "New York" 28 . . . . .
              275237 "Roosevelt" "Fresno" "CA" "Fresno" "Fresno" 29 52.8075 54.0029 54.8774 54.3057 53.5087
              343222 "Williamsbridge" "New York" "NY" "New York" "Bronx" 30 . . . . .
              270828 "East Harlem" "New York" "NY" "New York" "New York" 31 . . . . .
              274742 "North Scottsdale" "Scottsdale" "AZ" "Phoenix" "Maricopa" 32 96.1294 96.1086 96.7989 96.1687 97.8862
              194379 "East Flatbush" "New York" "NY" "New York" "Kings" 33 . . . . .
              35521 "Enterprise" "Las Vegas" "NV" "Las Vegas" "Clark" 34 82.651 85.4945 82.2642 78.9182 81.4286
              273565 "East San Jose" "San Jose" "CA" "San Jose" "Santa Clara" 35 128.5023 129.5438 131.6951 131.2723 131.9024
              267971 "South Mountain" "Phoenix" "AZ" "Phoenix" "Maricopa" 36 48.4479 50.6133 50.9066 51.4026 50.2977
              273555 "East Memphis-Colonial-Yorkshire" "Memphis" "TN" "Memphis" "Shelby" 37 46.4234 47.2015 49.2321 50.0269 50.1466
              10957 "Corona" "New York" "NY" "New York" "Queens" 38 . . . . .
              270945 "Sunset Park" "New York" "NY" "New York" "Kings" 39 . . . . .
              46795 "North Hollywood" "Los Angeles" "CA" "Los Angeles-Long Beach-Anaheim" "Los Angeles" 40 97.5472 100.0375 102.4289 102.9208 96.9305
              end
              
              rename (H-L) X=
              rename X* X#, renumber
              
              gen long obs_no = _n
              reshape long X, i(obs_no) j(month)
              replace month = month + tm(1996m4) - 1
              format month %tm
              So the two -rename- commands replace H through L by X1 through X5. If there are more variables than that, just replace H-L by the actual range of variable names and Stata will renumber accordingly.

              After the reshape command the month variable looks like 1, 2, 3, 4,... The final two lines serve to convert those sequential numbers into a sequence starting with the Stata internal format numerical representation of 1996m4, 1996m5, 1996m6, ..., and then apply a monthly date format so that the human eye can see what the numbers represent.

              If you are going to work with this as a panel data set, -xtset obs_no month- will do the trick.
              One thing I wanted to bring up is you mentioned that it's possible here "It is possible that some combination of the variables RegionID through SizeRank uniquely identifies observations in the data. But it isn't necessary to worry about that.". So Region ID corresponds to the Zip-code denoting the row of data to which it belongs. So the zip is definitely unique to the observation; would there be another way of going about it?

              Comment


              • #8
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str9 v1 str15 v2 str8 v3
                "Day Index" "Segment"         "Sessions"
                "8/8/15"    "All Users"       "57"      
                "8/8/15"    "New Users"       "50"      
                "8/8/15"    "Returning Users" "7"       
                "8/8/15"    "Paid Traffic"    "0"       
                "8/9/15"    "All Users"       "90"      
                "8/9/15"    "New Users"       "80"      
                "8/9/15"    "Returning Users" "10"      
                "8/9/15"    "Paid Traffic"    "0"       
                "8/10/15"   "All Users"       "96"      
                "8/10/15"   "New Users"       "70"      
                "8/10/15"   "Returning Users" "26"      
                "8/10/15"   "Paid Traffic"    "3"       
                "8/11/15"   "All Users"       "161"     
                "8/11/15"   "New Users"       "131"     
                "8/11/15"   "Returning Users" "30"      
                "8/11/15"   "Paid Traffic"    "2"       
                "8/12/15"   "All Users"       "113"     
                "8/12/15"   "New Users"       "79"      
                "8/12/15"   "Returning Users" "34"      
                "8/12/15"   "Paid Traffic"    "3"       
                "8/13/15"   "All Users"       "100"     
                "8/13/15"   "New Users"       "75"      
                "8/13/15"   "Returning Users" "25"      
                "8/13/15"   "Paid Traffic"    "2"       
                "8/14/15"   "All Users"       "84"      
                "8/14/15"   "New Users"       "62"      
                "8/14/15"   "Returning Users" "22"      
                "8/14/15"   "Paid Traffic"    "3"       
                "8/15/15"   "All Users"       "59"      
                "8/15/15"   "New Users"       "43"      
                "8/15/15"   "Returning Users" "16"      
                "8/15/15"   "Paid Traffic"    "0"       
                "8/16/15"   "All Users"       "71"      
                "8/16/15"   "New Users"       "52"      
                "8/16/15"   "Returning Users" "19"      
                "8/16/15"   "Paid Traffic"    "0"       
                "8/17/15"   "All Users"       "96"      
                "8/17/15"   "New Users"       "77"      
                "8/17/15"   "Returning Users" "19"      
                "8/17/15"   "Paid Traffic"    "3"       
                "8/18/15"   "All Users"       "124"     
                "8/18/15"   "New Users"       "98"      
                "8/18/15"   "Returning Users" "26"      
                "8/18/15"   "Paid Traffic"    "7"       
                "8/19/15"   "All Users"       "109"     
                "8/19/15"   "New Users"       "72"      
                "8/19/15"   "Returning Users" "37"      
                "8/19/15"   "Paid Traffic"    "3"       
                "8/20/15"   "All Users"       "131"     
                "8/20/15"   "New Users"       "97"      
                "8/20/15"   "Returning Users" "34"      
                "8/20/15"   "Paid Traffic"    "9"       
                "8/21/15"   "All Users"       "253"     
                "8/21/15"   "New Users"       "181"     
                "8/21/15"   "Returning Users" "72"      
                "8/21/15"   "Paid Traffic"    "8"       
                "8/22/15"   "All Users"       "74"      
                "8/22/15"   "New Users"       "53"      
                "8/22/15"   "Returning Users" "21"      
                "8/22/15"   "Paid Traffic"    "0"       
                "8/23/15"   "All Users"       "117"     
                "8/23/15"   "New Users"       "89"      
                "8/23/15"   "Returning Users" "28"      
                "8/23/15"   "Paid Traffic"    "0"       
                "8/24/15"   "All Users"       "158"     
                "8/24/15"   "New Users"       "120"     
                "8/24/15"   "Returning Users" "38"      
                "8/24/15"   "Paid Traffic"    "4"       
                "8/25/15"   "All Users"       "137"     
                "8/25/15"   "New Users"       "111"     
                "8/25/15"   "Returning Users" "26"      
                "8/25/15"   "Paid Traffic"    "8"       
                "8/26/15"   "All Users"       "141"     
                "8/26/15"   "New Users"       "103"     
                "8/26/15"   "Returning Users" "38"      
                "8/26/15"   "Paid Traffic"    "0"       
                "8/27/15"   "All Users"       "154"     
                "8/27/15"   "New Users"       "104"     
                "8/27/15"   "Returning Users" "50"      
                "8/27/15"   "Paid Traffic"    "4"       
                "8/28/15"   "All Users"       "113"     
                "8/28/15"   "New Users"       "80"      
                "8/28/15"   "Returning Users" "33"      
                "8/28/15"   "Paid Traffic"    "2"       
                "8/29/15"   "All Users"       "94"      
                "8/29/15"   "New Users"       "72"      
                "8/29/15"   "Returning Users" "22"      
                "8/29/15"   "Paid Traffic"    "0"       
                "8/30/15"   "All Users"       "110"     
                "8/30/15"   "New Users"       "86"      
                "8/30/15"   "Returning Users" "24"      
                "8/30/15"   "Paid Traffic"    "0"       
                "8/31/15"   "All Users"       "129"     
                "8/31/15"   "New Users"       "113"     
                "8/31/15"   "Returning Users" "16"      
                "8/31/15"   "Paid Traffic"    "2"       
                "9/1/15"    "All Users"       "159"     
                "9/1/15"    "New Users"       "133"     
                "9/1/15"    "Returning Users" "26"      
                end

                I also thought I'd ask: again this is all in efforts to refrain from using Excel for management so I apologize for redundancy. But with the data above it looks like it'd be a reshape wide command? I want dayindex to be the day of course and tsset set to daily, but the categorizations of whether the traffic was All User, Paid User etc, to be simply categorized by a number as an indicator.

                Comment


                • #9
                  Re #7: if the zip code uniquely identifies observations (that is, there are never two observations with the same zip code) then you cold skip the generation of the obs_no variable and you could replace obs_no by zip_code in the -i()- option of the -reshape- command. The end result will be the same, except there will be no obs_no variable.

                  Re #8. First, whoever created this data set did something weird because instead of having normal variable names, what appear to be the variable names you would want are in the first observation in the data set. The first observation is useless for any data management or analysis purposes. And the dates are not usable because they are strings. The same could be said of the sessions variable. So the first thing I would do to fix up this data set is:

                  Code:
                  // FIX VARIABLE NAMES
                  forvalues i = 1/3 {
                      rename v`i' `=strtoname(v`i'[1])'
                  }
                  drop in 1
                  
                  // CREATE STATA DATE VARIABLE
                  gen date = daily(Day_Index, "MD20Y")
                  assert missing(date) == missing(Day_Index)
                  drop Day_Index
                  format date %td
                  
                  // CONVERT SESSIONS TO NUMERIC
                  destring Sessions, replace
                  (If this data set was imported from a text file using -import delimited-, rather than fixing it, I'd go back and do it over, paying attention to the options offered by -import delimited- such as -varnames()-.)

                  Now you have a workable data set. Whether you would want to -reshape- this to wide depends on what you want to do with it. Most things in Stata are easier to do in long layout. On the other hand, I do notice that some of the observations have Segment == "All Users", and in those observations, Sessions seems to be the sum of the values of Sessions in the other observations for the same date. Observations like that will cause a great deal of confusion and trouble working in Stata and are probably best dropped. -drop if Segment = "All Users"-. But if you're going to do something that calls for wide-layout, then the extra variable Sesssions_All_Users that would be created probably won't be in the way.

                  Without knowing where you want to go, I can't really give you a map or directions.

                  Comment


                  • #10
                    In regard to #8, the data came out of the Google AI's platform for a medium sized law firm. Session involves their traffic data so All users is all of them, New users etc. So I would definitely want each observation down the data browser to be a new day -- so that I can run a time-series and run all the typical statistics. I could have a number designating Segment categorically and simply have it all listed under 1 variable. This is how I imaged doing it -- with maybe value labels? As for the first part and fixing the variables I'll go ahead and give that a try since it was straight out of google. I might quickly check to see if I can export to a more practical pivot table.

                    Comment


                    • #11
                      This is ideally what I'd like done in Stata. All of the units refer to "sessions" so it's redundant. Now since all four categories showed 0 I was able to say 0 in all four categories without the ridiculous concept of using the same date 4 times for each category.
                      Click image for larger version

Name:	2017-08-15_20-58-55.jpg
Views:	1
Size:	842.7 KB
ID:	1406535

                      Comment


                      • #12
                        So, to get a layout similar to the spreadsheet you show, after doing the fixing-up described in #9, you could do:

                        Code:
                        replace Segment = strtoname(Segment)
                        rename Sessions Sessions_
                        reshape wide Sessions_, i(date) j(Segment) string

                        Comment


                        • #13
                          So I finished most of the analysis but wanted to clarify a few things about the data restructuring in Stata. So using the command in #12 I wasn't able to get it work because of the:

                          // FIX VARIABLE NAMES
                          forvalues i = 1/3 {
                          rename v`i' `=strtoname(v`i'[1])'
                          }
                          drop in 1

                          // CREATE STATA DATE VARIABLE
                          gen date = daily(Day_Index, "MD20Y")
                          assert missing(date) == missing(Day_Index)
                          drop Day_Index
                          format date %td
                          Returned with error v1 not found. I copied over the data from Stata because I believe it actually has to do with the fact the date variable was imported differently using Excel vs CSV. So I wasn't able to get to generating the stata date variable the way you mentioned because I was similarly getting an error with the 'gen date' command. But either way I import I want to understand the syntax. Is there a fix with the below data if the date is formatted the way it is but has duplicates. Essentially the same questions as #11 but with the new -dataex-

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input int Day_Index str15 Segment int Sessions
                          20625 "All Users"       17
                          20625 "Paid Traffic"     0
                          20625 "Returning Users"  1
                          20625 "New Users"       16
                          20626 "All Users"       25
                          20626 "Paid Traffic"     0
                          20626 "Returning Users"  1
                          20626 "New Users"       24
                          20627 "All Users"       18
                          20627 "Paid Traffic"     0
                          20627 "Returning Users"  1
                          20627 "New Users"       17
                          20628 "All Users"       29
                          20628 "Paid Traffic"     0
                          20628 "Returning Users"  9
                          20628 "New Users"       20
                          20629 "All Users"       30
                          20629 "Paid Traffic"     0
                          20629 "Returning Users"  8
                          20629 "New Users"       22
                          20630 "All Users"       24
                          20630 "Paid Traffic"     0
                          20630 "Returning Users"  4
                          20630 "New Users"       20
                          20631 "All Users"       26
                          20631 "Paid Traffic"     0
                          20631 "Returning Users"  3
                          20631 "New Users"       23
                          20632 "All Users"       28
                          20632 "Paid Traffic"     0
                          20632 "Returning Users"  6
                          20632 "New Users"       22
                          20633 "All Users"       20
                          20633 "Paid Traffic"     0
                          20633 "Returning Users"  5
                          20633 "New Users"       15
                          20634 "All Users"       21
                          20634 "Paid Traffic"     0
                          20634 "Returning Users"  2
                          20634 "New Users"       19
                          20635 "All Users"       14
                          20635 "Paid Traffic"     0
                          20635 "Returning Users"  4
                          20635 "New Users"       10
                          20636 "All Users"       17
                          20636 "Paid Traffic"     0
                          20636 "Returning Users"  2
                          20636 "New Users"       15
                          20637 "All Users"       15
                          20637 "Paid Traffic"     0
                          20637 "Returning Users"  1
                          20637 "New Users"       14
                          20638 "All Users"       16
                          20638 "Paid Traffic"     0
                          20638 "Returning Users"  2
                          20638 "New Users"       14
                          20639 "All Users"       43
                          20639 "Paid Traffic"     0
                          20639 "Returning Users"  5
                          20639 "New Users"       38
                          20640 "All Users"       49
                          20640 "Paid Traffic"     0
                          20640 "Returning Users"  2
                          20640 "New Users"       47
                          20641 "All Users"       19
                          20641 "Paid Traffic"     0
                          20641 "Returning Users"  2
                          20641 "New Users"       17
                          20642 "All Users"       26
                          20642 "Paid Traffic"     0
                          20642 "Returning Users"  1
                          20642 "New Users"       25
                          20643 "All Users"       18
                          20643 "Paid Traffic"     0
                          20643 "Returning Users"  5
                          20643 "New Users"       13
                          20644 "All Users"       19
                          20644 "Paid Traffic"     0
                          20644 "Returning Users"  2
                          20644 "New Users"       17
                          20645 "All Users"       17
                          20645 "Paid Traffic"     0
                          20645 "Returning Users"  2
                          20645 "New Users"       15
                          20646 "All Users"       16
                          20646 "Paid Traffic"     0
                          20646 "Returning Users"  1
                          20646 "New Users"       15
                          20647 "All Users"       21
                          20647 "Paid Traffic"     0
                          20647 "Returning Users"  7
                          20647 "New Users"       14
                          20648 "All Users"       49
                          20648 "Paid Traffic"     0
                          20648 "Returning Users"  2
                          20648 "New Users"       47
                          20649 "All Users"       39
                          20649 "Paid Traffic"     0
                          20649 "Returning Users"  4
                          20649 "New Users"       35
                          end
                          format %tdnn/dd/CCYY Day_Index

                          Thanks a million.
                          -AS
                          Last edited by Ali Sheikhpour; 17 Aug 2017, 16:08.

                          Comment


                          • #14
                            Yes, those fixes in #9 were based on the earlier example data shown in #8. In that data set, the variables came in as v1, v2, and v3, with the "real" variable names in the first observation. Moreover variable v1 came in as a string, not a date. That is what I would expect to happen when you use -import delimited- for a csv file if you don't specify the -varnames(1)- option. The code in #9 was designed to fix that up and make the data suitable for reshaping.

                            The data example you show in #13 is different. It's variable names are Day_Index, Segment, and Sessions, and the variable Day_Index is a correct Stata internal format date variable. So #13's data is what you would get after applying the code in #9 to the data in #8 (except for the variable Day_Index, which would be named date, but would be otherwise the same.) So there is no need to apply the code in #9 to the data in #13. In fact, you can't--as you found out. The code in #9 breaks immediately because the first thing it does is look for a variable named v1, and there isn't any. Similarly, you tried to convert Day_Index to a date variable, but you can't do that either because it already is a date variable and the code is looking for a string variable.

                            With the data shown in #13, the code to reshape it into wide layout is essentially the same as shown in #12, except that date is replaced by Day_Index:

                            Code:
                            replace Segment = strtoname(Segment)
                            rename Sessions Sessions_
                            reshape wide Sessions_, i(Day_Index) j(Segment) string

                            Comment

                            Working...
                            X