Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge for multiple variables

    I have two data sets for publication and author ID. In the master file, I have the names and author IDs of individuals of my interest. In the second file, I have teh scopus data. In this dataset multiple authors appear in the same publication, and hence, I split teh file. Now against each publication, I have multiple authors and their corrsponding author IDs (authorid1, authorid2, authorid3. etc.). I want to merge the data set using teh first one.

    However, in the stata command it says, merge works only when both the files have same variable. Can somebody guide me on how to merge the two dataset?
    I am posting the sample of the data below.

    Dataset 1:
    Author AuthorID
    A. Jacquel 6507000366
    D. Bichet 7005292728
    D. Hérouart 6603700493
    J. Roux 7102297258
    J-F. Tanti 7003952333
    L. Buée 57203025864
    L. Mailleret 6506838574
    L. Ruel 6602839019
    L. Stemman 56133059000
    M. Mantegazza 6701735539

    Data Set 2:
    Authors authorsid001 authorsid002 authorsid003 authorsid004 authorsid005
    Aach J.; Bulyk M.L.; Church G.M.; Comander J.; Derti A.; Shendure J. 6603035323 7005451318 6603250535 6507643009 6602548264
    Aalten P.; Verhey F.R.J.; Boziki M.; Brugnolo A.; Bullock R.; Byrne E.J.; Camus V.; Caputo M.; Collins D.; De Deyn P.P.; Elina K.; Frisoni G.; Holmes C.; Hurt C.; Marriott A.; Mecocci P.; Nobili F.; Ousset P.J.; Reynish E.; Salmon E.; Tsolaki M.; Vellas B.; Robert P.H. 7007057244 15064974500 16306634100 7102712188 35399907100
    Aalten P.; Verhey F.R.J.; Boziki M.; Bullock R.; Byrne E.J.; Camus V.; Caputo M.; Collins D.; De Deyn P.P.; Elina K.; Frisoni G.; Girtler N.; Holmes C.; Hurt C.; Marriott A.; Mecocci P.; Nobili F.; Ousset P.J.; Reynish E.; Salmon E.; Tsolaki M.; Vellas B.; Robert P.H. 7007057244 15064974500 7102712188 35399907100 7003839024
    Aasheim H.-C.; Pedeutour F.; Grosgeorge J.; Logtenberg T. 7003433292 6603543905 56219430800
    Aasheim H.-C.; Pedeutour F.; Smeland E.B. 7003433292 35397579900
    Abad P.
    Abad P.; Cardin L.; Poupet A.; Ponchet M. 6602388443 6602779359 6603757981
    Abad P.; Castagnone-Sereno P.; Rosso M.-N.; de Almeida Engler J.; Favery B. 55931175400 7005224485 7003408464 6507058549

    In the second data file, I have authorid values ranging from 1-1046 in the same publication.

  • #2
    In the second data, why are the number of author IDs disagreeing with the number of authors in the first column?

    Also, please review the FAQ (https://www.statalist.org/forums/help) section 12 on how to use dataex command to post sample data.

    Comment


    • #3
      As I had mentioned, there are about 1046 rows of data for author ID. Not all are equal. For example, some publications may have only 2 or 3 authors where as some have hundreds of authors.
      Above data is only a small subset of my dataset. It was difficult for me to copy the entire 1046 rows here and hence, it might seem like there is a mismatch. However, in the complete data set, there is no issues.

      Comment


      • #4
        I am not asking why some publications have more authors and some fewer. I am asking why a publication with 4 authors only has 3 author ID. Like in the case of "Abad P.; Cardin L.; Poupet A.; Ponchet M.", there were only three IDs: "6602388443 6602779359 6603757981".

        Comment


        • #5
          Thanks Ken Chui . I get your point. That is the data from scopus. Might be some issues with downloading. I will have to look into it. However, I will be thankful if you have any insights on how I can go about with the merge function in such a case.

          Comment


          • #6
            I don't have any confident solution because the data are sensible to me. But if you meant how to reshape the wide form to long in data 2, there is a possible way:

            Code:
            clear
            input str100 Authors     str15(authorsid001     authorsid002     authorsid003     authorsid004     authorsid005)
            "Abad P.; Cardin L.; Poupet A.; Ponchet M."     6602388443     6602779359     6603757981         9999999999 .
            "Abad P.; Castagnone-Sereno P.; Rosso M.-N.; de Almeida Engler J.; Favery B."     55931175400     7005224485     7003408464     6507058549     8888888888
            end
            
            gen id = _n
            
            local ct = 1
            foreach x of varlist authorsid001 - authorsid005{
                rename `x' authorsid`ct'
                local ct = `ct' + 1
            }
            
            split Authors, gen(author_name) parse(;)
            drop Authors
            
            reshape long authorsid author_name, i(id) j(index)
            Maybe with that you can do something with the data 1. I am not sure what you meant by "merging" because it does not make sense to me to merge (aka add columns) authors and id to another data that also has authors and id. Maybe it was meant to be append, I am not sure.

            It'd be great to show a final data to show us the wanted data set. But you need to fix that data 2 first.

            Comment


            • #7
              Thank you Ken Chui . These are two different data sets. In one I have the publication details like author, author IDs (like datset 2) title, year, affiliation etc.... In the second dataset, I have the details of the author and author Id alone. I have a third dataset, which has citation details of every publication i.e. annual citations received by a given publication for a duration of 40 years. For the third dataset, I have only the title. My idea was to match the dataset 1 and datset2, thereby generating a dataset which will have title and author ID and then merge ot with the third dataset, as I need to have total cittaion received by the author per year.

              I tried the way you suggested to reshape the data. However, the data is too large and it returns error.

              Comment

              Working...
              X