Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge Datasets with multiple observations for one year

    Hello everybody,

    I am currently trying to merge two datasets. One Dataset (Platforms) contains the annual traffic of a company and the other (test1) contains the executives. In the Platforms dataset I have different years and companies and for each year and company one number of traffic. In the test1 database, I have multiple executives for each year and each company. I want to use the Platforms database as my base and merge the test1 on it so that I get for one year multiple executives but the same number of traffic in each row for that year. I need to do it this way because I am going to merge for each year one different dataset with the executives.

    My idea for the code was the following:
    The variable name for year is "year" and for company is "dp"

    Code:
    use "Platform"
    merge 1:m dp year using "test1.dta"
    Here you can find an example of one company and one year for each of the two datasets.

    Platforms:
    Code:
    year    dp    totalplatact
    2013    airbnb    5
    2014    airbnb    66
    2015    airbnb    54
    2016    airbnb    34
    2017    airbnb    21
    2018    airbnb    46
    2019    airbnb    12
    2020    airbnb    8
    test1
    Code:
    person_name    dp    year
    Erin Coffman    airbnb    2020
    Harrison Shoff    airbnb    2020
    Brian Chesky    airbnb    2020
    Scott Raymond    airbnb    2020
    Damir Duskan    airbnb    2020
    Justin K Chen    airbnb    2020
    MatÃ*as Vives    airbnb    2020
    Jackson Wang    airbnb    2020
    David Leach    airbnb    2020
    Chip Conley    airbnb    2020
    Justin K Chen    airbnb    2020
    Peter Coles    airbnb    2020
    Josh Horowitz    airbnb    2020
    Christopher Lukezic    airbnb    2020
    Peter Coles    airbnb    2020
    Justin K Chen    airbnb    2020
    Joe Gebbia    airbnb    2020
    Dzmitry Bazhko    airbnb    2020
    If I use my code I get for the year 2020 just one executive in the merged data and not all the executives.

    Can anyone help me with the code?

    Thanks in advance and best regards Jana

  • #2
    First, in future do use dataex to type the data so that others can quickly read that into Stata to test their codes.

    Second, I was not able to replicate the error you mentioned, I got all the exe. without problem:

    Code:
    clear
    input str30 person_name  str10  dp    year
    "Erin Coffman"    airbnb    2020
    "Harrison Shoff"    airbnb    2020
    "Brian Chesky"    airbnb    2020
    "Scott Raymond"    airbnb    2020
    "Damir Duskan"    airbnb    2020
    "Justin K Chen"    airbnb    2020
    "MatÃ*as Vives"    airbnb    2020
    "Jackson Wang"    airbnb    2020
    "David Leach"    airbnb    2020
    "Chip Conley"    airbnb    2020
    "Justin K Chen"    airbnb    2020
    "Peter Coles"    airbnb    2020
    "Josh Horowitz"    airbnb    2020
    "Christopher Lukezic"    airbnb    2020
    "Peter Coles"    airbnb    2020
    "Justin K Chen"    airbnb    2020
    "Joe Gebbia"    airbnb    2020
    "Dzmitry Bazhko"    airbnb    2020
    end
    tempfile test1
    save `test1', replace
    
    clear
    input year str10 dp totalplatact
    2013    airbnb    5
    2014    airbnb    66
    2015    airbnb    54
    2016    airbnb    34
    2017    airbnb    21
    2018    airbnb    46
    2019    airbnb    12
    2020    airbnb    8
    end
    
    merge 1:m dp year using `test1'
    
    list, sepby(year)
    Resutls:

    Code:
         +------------------------------------------------------------------+
         | year       dp   totalp~t           person_name            _merge |
         |------------------------------------------------------------------|
      1. | 2013   airbnb          5                         Master only (1) |
         |------------------------------------------------------------------|
      2. | 2014   airbnb         66                         Master only (1) |
         |------------------------------------------------------------------|
      3. | 2015   airbnb         54                         Master only (1) |
         |------------------------------------------------------------------|
      4. | 2016   airbnb         34                         Master only (1) |
         |------------------------------------------------------------------|
      5. | 2017   airbnb         21                         Master only (1) |
         |------------------------------------------------------------------|
      6. | 2018   airbnb         46                         Master only (1) |
         |------------------------------------------------------------------|
      7. | 2019   airbnb         12                         Master only (1) |
         |------------------------------------------------------------------|
      8. | 2020   airbnb          8          Erin Coffman       Matched (3) |
      9. | 2020   airbnb          8        Harrison Shoff       Matched (3) |
     10. | 2020   airbnb          8          Brian Chesky       Matched (3) |
     11. | 2020   airbnb          8         Scott Raymond       Matched (3) |
     12. | 2020   airbnb          8          Damir Duskan       Matched (3) |
     13. | 2020   airbnb          8         Justin K Chen       Matched (3) |
     14. | 2020   airbnb          8         MatÃ*as Vives       Matched (3) |
     15. | 2020   airbnb          8          Jackson Wang       Matched (3) |
     16. | 2020   airbnb          8           David Leach       Matched (3) |
     17. | 2020   airbnb          8           Chip Conley       Matched (3) |
     18. | 2020   airbnb          8         Justin K Chen       Matched (3) |
     19. | 2020   airbnb          8           Peter Coles       Matched (3) |
     20. | 2020   airbnb          8         Josh Horowitz       Matched (3) |
     21. | 2020   airbnb          8   Christopher Lukezic       Matched (3) |
     22. | 2020   airbnb          8           Peter Coles       Matched (3) |
     23. | 2020   airbnb          8         Justin K Chen       Matched (3) |
     24. | 2020   airbnb          8            Joe Gebbia       Matched (3) |
     25. | 2020   airbnb          8        Dzmitry Bazhko       Matched (3) |
         +------------------------------------------------------------------+
    Do check if there are any leading or trailing space in the company name variable. For example, "airbnb" will not be merged with "airbnb " or " airbnb". If there are spaces like that, check out help string functions, look for strtrim().
    Last edited by Ken Chui; 25 Dec 2022, 08:15.

    Comment


    • #3
      Your merge command looks appropriate. The problem likely lies in your data.

      Unfortunately, you did not follow the guidance in the Statalist FAQ linked to from the top of every page, as well as from the page shown when creating a new topic, which provides advice on effectively posing your questions, posting data, and sharing Stata output.

      There are many ways your data might be organized that are consistent with your description, and each would require a somewhat different approach.

      In order to get a helpful response, you need to show some example data, using the dataex command to do this. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      When asking for help with code, always show example data. When showing example data, always use the dataex command.

      With that said, here are two guesses that come to mind.

      The variable dp is a string variable, and in one or both of your datasets, the values contain leading or trailing blanks. If this is the case, the remedy is
      Code:
      replace dp = ustrtrim(dp)
      in one or both datasets.

      The variable year, although displayed as numeric digits, is in fact also a string variable, with the same problem as that described for dp, and the same remedy.

      Comment


      • #4
        Thanks for your fast reply!

        From now on I will use the dataex command. Thanks for this tip!

        It is not the space or anything in the data. I am wondering why its not working.

        Here again my data. This time with the dataex command maybe you can find my problem.

        Platform:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str24 dp int year float totalplatact
        "airbnb" 2013    7802261
        "airbnb" 2014   24028452
        "airbnb" 2015   87655648
        "airbnb" 2016  238648768
        "airbnb" 2017  609826560
        "airbnb" 2018 1167289344
        "airbnb" 2019 1448031360
        "airbnb" 2020  290765088
        "10"     2013      15298
        "10"     2014     390814
        end

        test1:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str24 dp float year strL person_name
        "airbnb" 2020 "Erin Coffman"   
        "airbnb" 2020 "Harrison Shoff" 
        "airbnb" 2020 "Brian Chesky"   
        "airbnb" 2020 "Scott Raymond"  
        "airbnb" 2020 "Damir Duskan"   
        "airbnb" 2020 "Justin K Chen"  
        "airbnb" 2020 "MatÃ*as Vives"
        "airbnb" 2020 "Jackson Wang"   
        "airbnb" 2020 "David Leach"    
        "airbnb" 2020 "Chip Conley"    
        end
        I think my code should be the correct solution, but still don't know why its not woking?

        Thanks for your help in advance!
        Best regards Jana

        Comment


        • #5
          Thank you for the example data.

          First of all, if we just repeat the experiment in post #2 we get the same results - the merge works.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str24 dp int year float totalplatact
          "airbnb" 2013    7802261
          "airbnb" 2014   24028452
          "airbnb" 2015   87655648
          "airbnb" 2016  238648768
          "airbnb" 2017  609826560
          "airbnb" 2018 1167289344
          "airbnb" 2019 1448031360
          "airbnb" 2020  290765088
          "10"     2013      15298
          "10"     2014     390814
          end
          save "~/Downloads/platform.dta", replace
          
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str24 dp float year strL person_name
          "airbnb" 2020 "Erin Coffman"  
          "airbnb" 2020 "Harrison Shoff"
          "airbnb" 2020 "Brian Chesky"  
          "airbnb" 2020 "Scott Raymond"  
          "airbnb" 2020 "Damir Duskan"  
          "airbnb" 2020 "Justin K Chen"  
          "airbnb" 2020 "MatÃ*as Vives"
          "airbnb" 2020 "Jackson Wang"  
          "airbnb" 2020 "David Leach"    
          "airbnb" 2020 "Chip Conley"    
          end
          save "~/Downloads/test1.dta", replace
          
          use "~/Downloads/platform.dta", clear
          merge 1:m dp year using "~/Downloads/test1.dta"
          
          list, clean
          Code:
          . merge 1:m dp year using "~/Downloads/test1.dta"
          (variable year was int, now float to accommodate using data's values)
          
              Result                      Number of obs
              -----------------------------------------
              Not matched                             9
                  from master                         9  (_merge==1)
                  from using                          0  (_merge==2)
          
              Matched                                10  (_merge==3)
              -----------------------------------------
          
          .
          . list, clean
          
                     dp   year   totalp~t      person_name            _merge  
            1.       10   2013      15298                    Master only (1)  
            2.       10   2014     390814                    Master only (1)  
            3.   airbnb   2013    7802261                    Master only (1)  
            4.   airbnb   2014   2.40e+07                    Master only (1)  
            5.   airbnb   2015   8.77e+07                    Master only (1)  
            6.   airbnb   2016   2.39e+08                    Master only (1)  
            7.   airbnb   2017   6.10e+08                    Master only (1)  
            8.   airbnb   2018   1.17e+09                    Master only (1)  
            9.   airbnb   2019   1.45e+09                    Master only (1)  
           10.   airbnb   2020   2.91e+08     Erin Coffman       Matched (3)  
           11.   airbnb   2020   2.91e+08   Harrison Shoff       Matched (3)  
           12.   airbnb   2020   2.91e+08     Brian Chesky       Matched (3)  
           13.   airbnb   2020   2.91e+08    Scott Raymond       Matched (3)  
           14.   airbnb   2020   2.91e+08     Damir Duskan       Matched (3)  
           15.   airbnb   2020   2.91e+08    Justin K Chen       Matched (3)  
           16.   airbnb   2020   2.91e+08    MatÃ*as Vives       Matched (3)  
           17.   airbnb   2020   2.91e+08     Jackson Wang       Matched (3)  
           18.   airbnb   2020   2.91e+08      David Leach       Matched (3)  
           19.   airbnb   2020   2.91e+08      Chip Conley       Matched (3)
          But dataex tells us two odd things about your data.

          The one that's important to the merge is that in your test1 dataset dataex tells us year is stored as a float rather than an integer, even though it looks like an integer. This suggests to me that year was created as the result of calculations, and the values displayed as 2020 are not in fact all precisely equal to 2020, despite how they are displayed by dataex. Here's an example that demonstrates what I mean, but not necessarily how the problem arose in your case.
          Code:
          . set obs 5
          Number of observations (_N) was 0, now 5.
          
          . generate int year = 1999+_n
          
          . generate z = year/10000
          
          . format z %9.4f
          
          . generate year2 = z * 10000
          
          . generate diff2 = year - year2
          
          . count if diff2==0
            3
          
          . generate int year3 = round(z * 10000)
          
          . generate diff3 = year - year3
          
          . count if diff3==0
            5
          
          . list, clean
          
                 year        z   year2       diff2   year3   diff3  
            1.   2000   0.2000    2000           0    2000       0  
            2.   2001   0.2001    2001           0    2001       0  
            3.   2002   0.2002    2002   -.0001221    2002       0  
            4.   2003   0.2003    2003    .0001221    2003       0  
            5.   2004   0.2004    2004           0    2004       0
          Let me admit that dataex in this case fails to give us a precise rendering of year2, but that is to be expected - the computer represents fractions in base 2, and they cannot always be represented precisely in base 10.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input int year float(z year2 diff2) int year3 float diff3
          2000    .2 2000             0 2000 0
          2001 .2001 2001             0 2001 0
          2002 .2002 2002 -.00012207031 2002 0
          2003 .2003 2003  .00012207031 2003 0
          2004 .2004 2004             0 2004 0
          end
          The problem, though, is that it is not good programming practice to try to exactly match numbers that may contain fractional parts, and that is in this case what merge is forced to do. And dataex can run into precision problems when representing fractional numbers.

          So the bottom line on your merge is that in your test1 dataset you need to recreate year as an int in the way that I created year3 in the example. But better yet you should go back to the process that created that test1 and recode it to correctly create year as an int to begin with.

          And I say that because you also have a problem in your platform dataset that you need to resulve. The dataex output tells us your variable totalplatact is a float, which can store at most a 7-digit integer with full accuracy. But you have 10-digit numbers like 1,167,289,344. Consider the following.
          Code:
          . input float x
          
                       x
            1.  1167289343
            2.  1167289344
            3.  1167289345
            4. end
          
          . format %16.0f x
          
          . list
          
               +------------+
               |          x |
               |------------|
            1. | 1167289344 |
            2. | 1167289344 |
            3. | 1167289344 |
               +------------+
          So who knows what the number now represented as 1167289344 was really supposed to be?

          You need to go back to the process that created totalplatact and create it as a double rather than a float to have the full accuracy you think you have.

          These issues are discussed in the output of help precision.

          Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

          byte - 7 bits -127 100
          int - 15 bits -32,767 32,740
          long - 31 bits -2,147,483,647 2,147,483,620
          float - 24 bits -16,777,216 16,777,216
          double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992
          Last edited by William Lisowski; 26 Dec 2022, 08:33.

          Comment


          • #6
            Hey William,
            thanks a lot for your help!
            You are right, I corrected the data and now my merge works.

            Best regards, Jana

            Comment

            Working...
            X