Dropping many observations based on the value of two observations

Nicola Cenacchi

Join Date: Sep 2023

Posts: 17
#1

Dropping many observations based on the value of two observations

01 Nov 2023, 10:12

Hi everyone,

I'd like to describe the data I am working with and then ask the question further below:

I am working with geographical pixels. I have thousands of pixels, each representing a square of X by X kilometers on a map.

Each pixel is identified by a unique ID number, and it has several different pieces of data associated to it.
In my case each pixel can contain several land cover types, each identified by a specific name and with a value of area associated with it. For instance I have area of land cultivated under corn, rice, or wheat. I have area of forest, savanna, brush etc. So, basically each pixel has a combination of both agriculture and natural habitats.
For each pixel I have also calculated the total agricultural land (called totcrop) by simply summing up the area for each single crop.
In addition, for each pixel, the areas of these land covers are provided for the year 2005 and for the year 2050. It so happens that the year 2005 also corresponds to a specific climate (which we call "base") and the year 2050 correspond to a set of other climates (they are called "ipsl" as well as other names).

Below you see an excerpt of my data. The variables, from left to right are: IDpixel (the numeric code that identifies the pixel), year, climate, agfuture (this is a simulated agricultural future, not relevant to my question further below), database (again not specifically relevant to my question), landcover (as I was saying above: corn, wheat, barren land, tundra, broadleaf forest etc) and value of area.
I sorted the data by IDpixel, landcover and year to show you that inside each single pixel, I have two area values for each land cover (barren, broad, grass, irmaize, irwheat, totcrop etc) one for 2005 and one for 2050 (the year variable).

This is the problem I have not been able to solve.

Let's consider a single pixel. When the value of area for the entire agricultural area (called totcrop) is 0 both in 2005 and in 2050, I want all data for that pixel to be dropped from the dataset.
I will state it again in slightly different words in the hope of making myself clear: for each pixel I have data on area for different crops, as well as a value for total agricultural area, and different natural habitats - and I have these data for the year 2005 and for the year 2050.
If, for a specific pixel, the value of area for total agricultural area is 0 both in 2005 and in 2050, I want that pixel and all data associated with it to be erased from the dataset.

I have not been able to find a solution.

any help/ideas are welcome
Many thanks
Nicola

* Example generated by -dataex-. For more info, type help dataex
clear
input str6 IDpixel str4(year climate agfuture) str3 database str7 landcover float area
"72119" "2005" "base" "midd" "glc" "barren" 25896
"72119" "2050" "ipsl" "midd" "glc" "barren" 26794
"72119" "2005" "base" "midd" "glc" "broad" 19
"72119" "2050" "ipsl" "midd" "glc" "broad" 10
"72119" "2005" "base" "midd" "glc" "grass" 102705
"72119" "2050" "ipsl" "midd" "glc" "grass" 104611
"72119" "2005" "base" "midd" "glc" "irmaize" 0
"72119" "2050" "ipsl" "midd" "glc" "irmaize" 0
"72119" "2005" "base" "midd" "glc" "irother" 6517
"72119" "2050" "ipsl" "midd" "glc" "irother" 6517
"72119" "2005" "base" "midd" "glc" "irrice" 0
"72119" "2050" "ipsl" "midd" "glc" "irrice" 0
"72119" "2005" "base" "midd" "glc" "irsorg" 0
"72119" "2050" "ipsl" "midd" "glc" "irsorg" 0
"72119" "2005" "base" "midd" "glc" "irsoy" 0
"72119" "2050" "ipsl" "midd" "glc" "irsoy" 0
"72119" "2005" "base" "midd" "glc" "irwheat" 0
"72119" "2050" "ipsl" "midd" "glc" "irwheat" 0
"72119" "2005" "base" "midd" "glc" "mixed" 4493
"72119" "2050" "ipsl" "midd" "glc" "mixed" 1910
"72119" "2005" "ipsl" "midd" "glc" "totcrop" 4500
"72119" "2050" "ipsl" "midd" "glc" "totcrop" 6517
"72119" "2005" "base" "midd" "glc" "tundra" 117
"72119" "2050" "ipsl" "midd" "glc" "tundra" 9
"72119" "2005" "base" "midd" "glc" "woody" 888
"72119" "2050" "ipsl" "midd" "glc" "woody" 496
"72120" "2005" "base" "midd" "glc" "barren" 18813
"72120" "2050" "ipsl" "midd" "glc" "barren" 19757
"72120" "2005" "base" "midd" "glc" "broad" 17
"72120" "2050" "ipsl" "midd" "glc" "broad" 9
"72120" "2005" "base" "midd" "glc" "grass" 83772
"72120" "2050" "ipsl" "midd" "glc" "grass" 86198
"72120" "2005" "base" "midd" "glc" "irmaize" 0
"72120" "2050" "ipsl" "midd" "glc" "irmaize" 0
"72120" "2005" "base" "midd" "glc" "irother" 44677
"72120" "2050" "ipsl" "midd" "glc" "irother" 44677
"72120" "2005" "base" "midd" "glc" "irrice" 0
"72120" "2050" "ipsl" "midd" "glc" "irrice" 0
"72120" "2005" "base" "midd" "glc" "irsorg" 0
"72120" "2050" "ipsl" "midd" "glc" "irsorg" 0
"72120" "2005" "base" "midd" "glc" "irsoy" 0
"72120" "2050" "ipsl" "midd" "glc" "irsoy" 0
end
[/CODE]
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30150
#2

01 Nov 2023, 10:49

First, thank you for using -dataex- to show example data on your very first post here!

That said, I don't know how to relate your description to the example data itself. There is no variable called totcrop. Instead, totcrop is one possible value of the variable landcover. You said that totcrop is calculated by adding up the areas of all crops. There is no variable called crop. Perhaps landcover is what you meant? But when I add up the areas of all the different landcovers (other than totcrop itself) for each IDpixel in each year, it does not match the value of the area shown in the observation where landcover == "totcrop". Moreover, IDpixel 72120 doesn't even have any observations where landcover == "totcrop".

I'm sure that the solution to your problem is actually quite simple if we can clarify how we can find this totcrop number you refer to in your description in the data, but I just don't see how to do it with the example data.

Last edited by Clyde Schechter; 01 Nov 2023, 10:58.
Comment
Nicola Cenacchi

Join Date: Sep 2023

Posts: 17
#3

01 Nov 2023, 14:23

Thank you very much for replying Clyde.
All you wrote is correct. I will learn to be clearer in my exposition.

First of all, yes you are correct, totcrop is one of the possible values of landcover.
In regards to your other observations. Let me explain what I did, to see if it is enough context.
My dataset has thousands of observations, so I extracted a few here and there as an example.
IDpixel 72120 will have a totcrop too. It is not a given that all pixels will have area under maize or wheat. One of them may have area under maize in 2050 but not in 2005, another may not have maize at all. But all of them will have totcrop as a value of landcover, because I calculated it with whatever crop areas they had available (in a previous step of my data processing).

I understand that if you do have to use the data I pasted, it makes little sense for me not to have added totcrop to all pixels. Apologies again.
I pasted below a modified data, which now has totcrop for both the example pixels I am using. Because this makes sense in relation to my question, I made sure to give totcrop value of 0 both in 2005 and 2050 to the 72120 pixel.

Also, as I mentioned, these data are extracted almost randomly from my large dataset, so if you sum the single crop areas I doubt you would find that the calculation matches the totcrop. I did that because in my mind the way totcrop is generated did not seem to have a bearing on finding a solution to my question.... but I may be wrong .

My question, applied to the (now more complete) data below is this:
IDpixels 72120 has value of 0 for totcrop both in 2005 and 2050. Because this condition is satisfied, I want all the data relative to 72120 to be erased from my dataset.
Applying this to my very large dataset, it means that I want all pixels where totcrop is 0 both in 2005 and 2050 to be eliminated from the dataset.
To further explain ... what I am trying to do is filter my large dataset so that the only pixels that are preserved are those that have a non-zero amount of agriculture both in 2005 and in 2050.

I hope this is a little more clear, but happy to learn and explain myself further.
thank you very much again
Nicola

* Example generated by -dataex-. For more info, type help dataex
clear
input str6 IDpixel str4(year climate agfuture) str3 database str7 landcover float area
"72119" "2005" "base" "midd" "glc" "barren" 25896
"72119" "2050" "ipsl" "midd" "glc" "barren" 26794
"72119" "2005" "base" "midd" "glc" "broad" 19
"72119" "2050" "ipsl" "midd" "glc" "broad" 10
"72119" "2005" "base" "midd" "glc" "grass" 102705
"72119" "2050" "ipsl" "midd" "glc" "grass" 104611
"72119" "2005" "base" "midd" "glc" "irmaize" 0
"72119" "2050" "ipsl" "midd" "glc" "irmaize" 0
"72119" "2005" "base" "midd" "glc" "irother" 6517
"72119" "2050" "ipsl" "midd" "glc" "irother" 6517
"72119" "2005" "base" "midd" "glc" "irrice" 0
"72119" "2050" "ipsl" "midd" "glc" "irrice" 0
"72119" "2005" "base" "midd" "glc" "irsorg" 0
"72119" "2050" "ipsl" "midd" "glc" "irsorg" 0
"72119" "2005" "base" "midd" "glc" "irsoy" 0
"72119" "2050" "ipsl" "midd" "glc" "irsoy" 0
"72119" "2005" "base" "midd" "glc" "irwheat" 0
"72119" "2050" "ipsl" "midd" "glc" "irwheat" 0
"72119" "2005" "base" "midd" "glc" "mixed" 4493
"72119" "2050" "ipsl" "midd" "glc" "mixed" 1910
"72119" "2005" "ipsl" "midd" "glc" "totcrop" 4500
"72119" "2050" "ipsl" "midd" "glc" "totcrop" 6517
"72119" "2005" "base" "midd" "glc" "tundra" 117
"72119" "2050" "ipsl" "midd" "glc" "tundra" 9
"72119" "2005" "base" "midd" "glc" "woody" 888
"72119" "2050" "ipsl" "midd" "glc" "woody" 496
"72120" "2005" "base" "midd" "glc" "barren" 18813
"72120" "2050" "ipsl" "midd" "glc" "barren" 19757
"72120" "2005" "base" "midd" "glc" "broad" 17
"72120" "2050" "ipsl" "midd" "glc" "broad" 9
"72120" "2005" "base" "midd" "glc" "grass" 83772
"72120" "2050" "ipsl" "midd" "glc" "grass" 86198
"72120" "2005" "base" "midd" "glc" "irmaize" 0
"72120" "2050" "ipsl" "midd" "glc" "irmaize" 0
"72120" "2005" "base" "midd" "glc" "irother" 44677
"72120" "2050" "ipsl" "midd" "glc" "irother" 44677
"72120" "2005" "base" "midd" "glc" "irrice" 0
"72120" "2050" "ipsl" "midd" "glc" "irrice" 0
"72120" "2005" "base" "midd" "glc" "irsorg" 0
"72120" "2050" "ipsl" "midd" "glc" "irsorg" 0
"72120" "2005" "base" "midd" "glc" "irsoy" 0
"72120" "2050" "ipsl" "midd" "glc" "irsoy" 0
"72120" "2005" "ipsl" "midd" "glc" "totcrop" 0
"72120" "2050" "ipsl" "midd" "glc" "totcrop" 0
end
[/CODE]
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30150
#4

01 Nov 2023, 14:56

So I will make some assumptions about your data, which hold in the example. There are ways to do what you ask even if these assumptions are false, but the code is simplest if they are true. And they do seem to me to be natural assumptions:
The only values of year in the data set are 2005 and 2050.

Every IDPixel has a landcover == "totcrop" observation for both 2005 and 2050.

These are verified in the first two commands shown. If either gives an error message, do not proceed with the rest of the code. Just post back explaining and I can give some more complicated code that will work without those assumptions.

Code:

assert inlist(year, "2005", "2050") // ONLY YEARS ARE 2005, 2050 by IDpixel (year landcover), sort: assert year[1] != year[_N] // EACH IDpixel HAS BOTH YEARS' DATA by IDpixel year, sort: egen totcrop_area = max(cond(landcover == "totcrop", area, .)) by IDpixel (year): drop if totcrop_area[1] == 0 & totcrop_area[_N] == 0

As an aside, I'd like to advise you about your data organization. While it may not turn out to be a problem, it is unusual, and usually a bad idea, to have a year variable as a string. I would -destring year, replace- and go with the numeric year. It makes coding simpler because you can drop the quotes when referring to any particular year. It also saves memory, though that probably doesn't really matter much from a practical perspective. You might consider doing the same thing with IDpixel--though the advantages there are fewer since one almost never does calculations on IDs, and it is relatively uncommon to refer to individual values of an ID in code.

More important: I never would have created those observations with landcover == "totcrop" in the first place. While things like that are common when using spreadsheets, Stata is not a spreadsheet, and trying to work with it as if it were usually ends in tears. In particular, should the need arise to do any further calculations on the area variable, you will need to explicitly exclude the landcover == "totcrop" observations from the calculations, so you don't double count everything. Well, humans being humans, somewhere along the line you're likely to forget to do that, and then your results are wrong. And you might or might not notice that they are wrong, and so you might then share those results with others who rely on them to their detriment. What I would have done is create a separate totcrop variable: whatever sum of areas it is you calculated for those observations, you can calculate that as a Stata variable in its own right--that way it will never get accidentally thrown in with the individual areas when you perform calculations with the area variable. Had you done that, the last two lines of code I show above would have been reduced to a one-liner: -by IDpixel (year): drop if totcrop[1] == 0 & totcrop[_N] == 0-
Comment
Nicola Cenacchi

Join Date: Sep 2023

Posts: 17
#5

02 Nov 2023, 08:40

thank you very much Clyde,

the code worked. I am surely going to learn quite a bit from studying your four lines of code.

Regarding the year. Thank you for that.

About the use of totcrop as a variable. I am doing a lot of shaping and reshaping of my dataset, as I need looping across multiple files, calculating entropy indicators etc. At one point during the process my data was in wide format and each landcover value was actually a variable. So, totcrop was a variable too.
Are you suggesting that I could have only totcrop as a separate variable? I have difficulties picturing something different from having the dataset in wide form, as I mentioned (i.e., each single value of landcover as a separate variable). It would not be as easy for me to work with that.

In any case I completely understand the danger you are pointing out, if I keep the data in long-form. I will need to give it some thought as I proceed with my calculations.

thank you very much for taking the time to help me. I have quite a bit to learn and think about.
Nicola
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30150
#6

02 Nov 2023, 10:26

Are you suggesting that I could have only totcrop as a separate variable?

Yes, I am strongly suggesting that. It is not a good idea to have a separate observation which is the total (nor means, nor any other expression calculated from) other observations. Such "summary statistics" should be separate variables, for the reasons I gave in #4.

I have difficulties picturing something different from having the dataset in wide form, as I mentioned (i.e., each single value of landcover as a separate variable).

There are many situations where the choice between long and wide is clear, such as panel data. But your situation isn't one of them. It really depends on what analyses you will be doing and the role that the different crops play in these analyses. If the area allocated to each crop is ultimately going to be used as a variable in some regression analysis, then each crop will need to be a separate variable, i.e. a wide layout. If, however, relationships among the crops' allocated areas will be used in data management, as in the problem posed in this thread, then a long layout (with total crop as a separate variable) is the optimal arrangement. Perhaps you will do some data management in long layout, and, near the end, -reshape- to wide so that each crop can be a predictor in a regression model.

I would say that when you are unsure what to do, the long layout is more often going to be the right one. The major exception to that is when you are displaying the data in graphs or planning to export to a spreadsheet. For those purposes a wide layout is often preferable. There are also a few analysis commands such that require the data to be wide. But these commands are infrequently used in contemporary data analysis, and some of them can be emulated by other commands that require long layout. But if you have your data wide and you find yourself writing loops over a list of variables that are all attributes of the unit of analysis to do comparisons or summary calculations, you should at least stop and ask yourself whether a long layout might be better.
Comment
Nicola Cenacchi

Join Date: Sep 2023

Posts: 17
#7

02 Nov 2023, 14:20

thank you very much again for all these suggestions Clyde. Much appreciated.

Best,
Nicola
Comment

Announcement

Dropping many observations based on the value of two observations

Comment

Comment

Comment

Comment

Comment

Comment