Filtering a Dataset by a second dataset

Stephan Molekski

Join Date: May 2019

Posts: 7
#1

Filtering a Dataset by a second dataset

29 May 2019, 09:00

Hello everyone,

Let me preface that I am completely new to Stata. My problem is the following and I can't find a solution with google, so I hope you can help me

I have one Dataset that looks like the following:

ID // Name // Value // ...(more variables)

1 xyz 100
2 abc 120
...

(about 4 million rows)

The thing is I don't need all the rows only about 5000 of them and I have an Excel sheet with the IDs that I actually need. It is just one Column, like:

ID
1
2
3
...

I tried to export the dataset to excel and do it there, but it has too many rows for excel, so is there a way to get rid of the rows with IDs that are not in the Excel and keep the rest?

Thank you very much!
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#2

29 May 2019, 09:19

import the excel sheet into Stata (using -import excel-); make sure that both files are sorted on ID and then use -merge-;

Code:

help import excel help sort help merge

example code (please read the FAQ on showing data examples that others can use to help you):

Code:

merge 1:1 ID using idfile

this assumes that what is now your main Stata data set is in memory; it also assumes that ID's occur only once in each file
1 like
Comment
Stephan Molekski

Join Date: May 2019

Posts: 7
#3

29 May 2019, 10:05

Thank you!

But the first, long file that i want to reduce has a lot of rows with same IDs. But they only occur once in the excel list. Would that change something?

Last edited by Stephan Molekski; 29 May 2019, 10:10.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#4

29 May 2019, 10:29

assuming you want to match each of them, change the 1:1 part of the command to m:1
1 like
Comment
Stephan Molekski

Join Date: May 2019

Posts: 7
#5

29 May 2019, 10:44

Ok thank you very much!!
Comment
Stephan Molekski

Join Date: May 2019

Posts: 7
#6

29 May 2019, 12:36

I tried it out and ran into a problem. Although the final data set now only consists of IDs that are also in the Excel, there is in every case only one row per ID, where before multiple rows with the same ID were. Any idea how to keep multiple rows with the same idea through the merge?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#7

29 May 2019, 13:01

please show the exact command you used, in CODE blocks (see the FAQ if you don't understand this); what I suggested would not have had this effect
Comment
Stephan Molekski

Join Date: May 2019

Posts: 7
#8

29 May 2019, 13:31

Ok I created a data set based on the excel, so one column:

ID
1
2
3
...

Then I typed

Code:

merge m:1 ID using maindataset.dta

Since one ID has mostly multiple rows in the maindataset, they all should still be incorporated in the new data set as long as they are in the excel list as well, but the resulting data set has only one row per ID even if there were multiple rows before for that ID.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#9

29 May 2019, 13:36

sorry, I should have been clearer - also show the result (i.e., the tabulation of the _merge variable)
Comment

Stephan Molekski

Join Date: May 2019
Posts: 7

#10

29 May 2019, 14:06

So this is the code:

Code:

  merge m:1 ID using maindataset.dta

And this is the output

Code:

*(1 variable, 2289 observations pasted into data editor)

. merge 1:m id using maindatafile.dta

    Result                           # of obs.
    -----------------------------------------
    not matched                     3,345,852
        from master                       300  (_merge==1)
        from using                  3,345,552  (_merge==2)

    matched                           966,407  (_merge==3)
    -----------------------------------------

.

the ones that occur only in one of the two datasets still have all the rows with the given ID, only the ones that occur in both got reduced to only one row, even though there were multiple ones with that one ID before.

Announcement

Filtering a Dataset by a second dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment