Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering a Dataset by a second dataset

    Hello everyone,

    Let me preface that I am completely new to Stata. My problem is the following and I can't find a solution with google, so I hope you can help me

    I have one Dataset that looks like the following:

    ID // Name // Value // ...(more variables)

    1 xyz 100
    2 abc 120
    ...

    (about 4 million rows)


    The thing is I don't need all the rows only about 5000 of them and I have an Excel sheet with the IDs that I actually need. It is just one Column, like:

    ID
    1
    2
    3
    ...

    I tried to export the dataset to excel and do it there, but it has too many rows for excel, so is there a way to get rid of the rows with IDs that are not in the Excel and keep the rest?

    Thank you very much!

  • #2
    import the excel sheet into Stata (using -import excel-); make sure that both files are sorted on ID and then use -merge-;
    Code:
    help import excel
    help sort
    help merge
    example code (please read the FAQ on showing data examples that others can use to help you):
    Code:
    merge 1:1 ID using idfile
    this assumes that what is now your main Stata data set is in memory; it also assumes that ID's occur only once in each file

    Comment


    • #3
      Thank you!

      But the first, long file that i want to reduce has a lot of rows with same IDs. But they only occur once in the excel list. Would that change something?
      Last edited by Stephan Molekski; 29 May 2019, 10:10.

      Comment


      • #4
        assuming you want to match each of them, change the 1:1 part of the command to m:1

        Comment


        • #5
          Ok thank you very much!!

          Comment


          • #6
            I tried it out and ran into a problem. Although the final data set now only consists of IDs that are also in the Excel, there is in every case only one row per ID, where before multiple rows with the same ID were. Any idea how to keep multiple rows with the same idea through the merge?

            Comment


            • #7
              please show the exact command you used, in CODE blocks (see the FAQ if you don't understand this); what I suggested would not have had this effect

              Comment


              • #8
                Ok I created a data set based on the excel, so one column:

                ID
                1
                2
                3
                ...

                Then I typed

                Code:
                  
                 merge m:1 ID using maindataset.dta
                Since one ID has mostly multiple rows in the maindataset, they all should still be incorporated in the new data set as long as they are in the excel list as well, but the resulting data set has only one row per ID even if there were multiple rows before for that ID.

                Comment


                • #9
                  sorry, I should have been clearer - also show the result (i.e., the tabulation of the _merge variable)

                  Comment


                  • #10
                    So this is the code:

                    Code:
                      merge m:1 ID using maindataset.dta
                    And this is the output

                    Code:
                    *(1 variable, 2289 observations pasted into data editor)
                    
                    . merge 1:m id using maindatafile.dta
                    
                        Result                           # of obs.
                        -----------------------------------------
                        not matched                     3,345,852
                            from master                       300  (_merge==1)
                            from using                  3,345,552  (_merge==2)
                    
                        matched                           966,407  (_merge==3)
                        -----------------------------------------
                    
                    .
                    the ones that occur only in one of the two datasets still have all the rows with the given ID, only the ones that occur in both got reduced to only one row, even though there were multiple ones with that one ID before.

                    Comment

                    Working...
                    X