Re-organizing a dataset for a better study

Aziz Essouaied

Join Date: Apr 2020
Posts: 203

Re-organizing a dataset for a better study

10 Nov 2023, 13:40

Hello Stata people;

I do have this dataset that I'm working on:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str13 governorate int year float consecutivedrydays
"Pog_Ariana" 1990 55.416
"Pog_Ariana" 1991 55.611
"Pog_Ariana" 1992 55.498
"Pog_Ariana" 1993 55.121
"Pog_Ariana" 1994 55.315
"Pog_Ariana" 1995 55.026
"Pog_Ariana" 1996  55.28
"Pog_Ariana" 1997 55.287
"Pog_Ariana" 1998 54.718
"Pog_Ariana" 1999 55.035
"Pog_Ariana" 2000 55.303
"Pog_Ariana" 2001  55.43
"Pog_Ariana" 2002 55.978
"Pog_Ariana" 2003 56.034
"Pog_Ariana" 2004 55.616
"Pog_Ariana" 2005 55.552
"Pog_Ariana" 2006 55.796
"Pog_Ariana" 2007 55.925
"Pog_Ariana" 2008 57.046
"Pog_Ariana" 2009 57.684
"Pog_Ariana" 2010 57.847
"Pog_Ariana" 2011 58.432
"Pog_Ariana" 2012 59.118
"Pog_Ariana" 2013 59.716
"Pog_Ariana" 2014 59.805
"Pog_Ariana" 2015 60.157
"Pog_Ariana" 2016 60.294
"Pog_Ariana" 2017 60.687
"Pog_Ariana" 2018 60.913
"Pog_Ariana" 2019 61.043
"Pog_Ariana" 2020 61.141
"Pog_Ariana" 2021 61.382
"Pog_Ariana" 2022 61.224
"Pog_Ariana" 2023 60.971
"Pog_Ariana" 2024 61.365
"Pog_Ariana" 2025 61.225
"Pog_Ariana" 2026 61.413
"Pog_Ariana" 2027 61.623
"Pog_Ariana" 2028 61.204
"Pog_Ariana" 2029 61.612
"Pog_Ariana" 2030 61.575
"Pog_Ariana" 2031 61.422
"Pog_Ariana" 2032 61.373
"Pog_Ariana" 2033  61.22
"Pog_Ariana" 2034 61.392
"Pog_Ariana" 2035 61.298
"Pog_Ariana" 2036 61.844
"Pog_Ariana" 2037 61.734
"Pog_Ariana" 2038 62.245
"Pog_Ariana" 2039 62.538
"Pog_Ariana" 2040 62.266
"Pog_Ariana" 2041 62.395
"Pog_Ariana" 2042 61.972
"Pog_Ariana" 2043 62.477
"Pog_Ariana" 2044 62.891
"Pog_Ariana" 2045 63.642
"Pog_Ariana" 2046  63.99
"Pog_Ariana" 2047 63.607
"Pog_Ariana" 2048 63.889
"Pog_Ariana" 2049  63.26
"Pog_Ariana" 2050  63.22
"Pog_Ariana" 2051 63.176
"Pog_Ariana" 2052 63.148
"Pog_Ariana" 2053 63.178
"Pog_Ariana" 2054  63.26
"Pog_Ariana" 2055 63.319
"Pog_Ariana" 2056 63.387
"Pog_Ariana" 2057 63.866
"Pog_Ariana" 2058 63.603
"Pog_Ariana" 2059 63.137
"Pog_Ariana" 2060 63.503
"Pog_Ariana" 2061 64.113
"Pog_Ariana" 2062  64.98
"Pog_Ariana" 2063 64.831
"Pog_Ariana" 2064 64.794
"Pog_Ariana" 2065 64.827
"Pog_Ariana" 2066 64.873
"Pog_Ariana" 2067 65.544
"Pog_Ariana" 2068 65.852
"Pog_Ariana" 2069 65.862
"Pog_Ariana" 2070 66.288
"Pog_Ariana" 2071 66.301
"Pog_Ariana" 2072 66.478
"Pog_Ariana" 2073 66.622
"Pog_Ariana" 2074 66.786
"Pog_Ariana" 2075 66.534
"Pog_Ariana" 2076 66.666
"Pog_Ariana" 2077 66.317
"Pog_Ariana" 2078 66.593
"Pog_Ariana" 2079 67.113
"Pog_Ariana" 2080 67.785
"Pog_Ariana" 2081 67.488
"Pog_Ariana" 2082 67.445
"Pog_Ariana" 2083 67.624
"Pog_Ariana" 2084 67.988
"Pog_Ariana" 2085 68.134
"Pog_Beja"   1990 52.289
"Pog_Beja"   1991 52.549
"Pog_Beja"   1992 52.294
"Pog_Beja"   1993 51.816
end

The first variable is a "string" variable dealing with governorates of a country, the second variable is a year variable, and the third one deals with the number of dry days per governorate per year.
My goal is to re-organize this data set in order to have the governorates as my x variable, and the time variable "Year" as the Y variable, as well as to delete that "Prog_" thing in the begining of the name of governorates.

Any help please? I'll be very grateful.

Thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

10 Nov 2023, 13:54

Code:

replace governorate = subinstr(governorate, "Pog_", "", 1)

will clean up your governorate variable.

As for "My goal is to re-organize this data set in order to have the governorates as my x variable, and the time variable "Year" as the Y variable..." I do not know what you mean, as you provide no context for the type of analysis you plan to do. Suffice it to say, however, that if you are planning a regression analysis with Year as the outcome variable and governorate as a regressor, then the only reorganization needed is to convert governorate from a string variable to an integer-valued numeric variable with value labels:

Code:

encode governorate, gen(n_governorate) drop governorate rename n_governorate governorate

Then you would be able to do something like -regress Year i.governorate-.

That said, it is odd to have Year as an outcome variable in a model. Even odder, in your case, is that your value of Year extends far into the future. How is that even possible?
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 203
#3

10 Nov 2023, 23:44

Clyde Schechter Thanks for the help!

My goal is to do some mapping after reorganizing the dataset, I'm planning to create maps (like 5 of them) to show the evolution of the variable "consecutivedrydays" by decade. So, is there a way to organize the data to help me go through that?

Thanks for the help!
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 203
#4

10 Nov 2023, 23:52

Clyde Schechter The variable that I wanna study is "consecutivedrydays", so I want to organize the dataset in a way that I don't have the "governorate" variable being repeated for each year. The variable to be explained will be indexed by "Year" and "Governorate" variables.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#5

11 Nov 2023, 11:08

I'm not the best person to answer this question as I don't do this kind of mapping. That means I don't know what organization is best for the command(s) that do that. Suffice it to say that you have a long data layout which is the best arrangement for most Stata commands.

There are two possible wide layouts that are alternatives for this. In one, there would be a single observation for each governorate, and the dry days values for each year would be in separate variables. In the other, there would be a single observation for each year, and the dry days values for each governorate would be in separate variables. For most Stata commands this organization makes life difficult or impossible, but there are some for which it is optimal. I don't know which organization will work better for you.

Starting from your original data (don't run the code offered in #2), the following code shows how. DON'T try to run both methods. Pick one or the other only.

Code:

replace governorate = subinstr(governorate, "Pog_", "", 1) rename consecutivedrydays drydays // METHOD 1: TO GET ONE OBSERVATION PER GOVERNORATE reshape wide drydays, i(governorate) j(year) // METHOD 2: TO GET ONE OBSERVATION PER YEAR replace governorate = substr(strtoname(governorate), 1, 26) reshape wide drydays, i(year) j(governorate) string
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 203
#6

11 Nov 2023, 23:43

Clyde Schechter

I've tried both methods, and I do believe that the first method could be the most adequate for the job.

The kind of mapping that I'm willing to use is the "spmap" package, already as you can see, there are 24 governorates, so I'm planning to draw a series of maps (by decades: it is going to be 9 maps since it is 9 decades from 1990 to 2085), and I guess the first method could be the one to help me for that.

So the work would be to put together this first dataset and the dataset for the mapping (the "shape" file), and I do guess that the first methode could help.

Thanks for the help!
Comment

Announcement

Re-organizing a dataset for a better study

Comment

Comment

Comment

Comment

Comment