Merge for multiple variables

Sreenivasa Prabhu

Join Date: Oct 2023
Posts: 8

Merge for multiple variables

16 Nov 2023, 11:02

I have two data sets for publication and author ID. In the master file, I have the names and author IDs of individuals of my interest. In the second file, I have teh scopus data. In this dataset multiple authors appear in the same publication, and hence, I split teh file. Now against each publication, I have multiple authors and their corrsponding author IDs (authorid1, authorid2, authorid3. etc.). I want to merge the data set using teh first one.

However, in the stata command it says, merge works only when both the files have same variable. Can somebody guide me on how to merge the two dataset?
I am posting the sample of the data below.

Dataset 1:

Author	AuthorID
A. Jacquel	6507000366
D. Bichet	7005292728
D. Hérouart	6603700493
J. Roux	7102297258
J-F. Tanti	7003952333
L. Buée	57203025864
L. Mailleret	6506838574
L. Ruel	6602839019
L. Stemman	56133059000
M. Mantegazza	6701735539

Data Set 2:

Authors	authorsid001	authorsid002	authorsid003	authorsid004	authorsid005
Aach J.; Bulyk M.L.; Church G.M.; Comander J.; Derti A.; Shendure J.	6603035323	7005451318	6603250535	6507643009	6602548264
Aalten P.; Verhey F.R.J.; Boziki M.; Brugnolo A.; Bullock R.; Byrne E.J.; Camus V.; Caputo M.; Collins D.; De Deyn P.P.; Elina K.; Frisoni G.; Holmes C.; Hurt C.; Marriott A.; Mecocci P.; Nobili F.; Ousset P.J.; Reynish E.; Salmon E.; Tsolaki M.; Vellas B.; Robert P.H.	7007057244	15064974500	16306634100	7102712188	35399907100
Aalten P.; Verhey F.R.J.; Boziki M.; Bullock R.; Byrne E.J.; Camus V.; Caputo M.; Collins D.; De Deyn P.P.; Elina K.; Frisoni G.; Girtler N.; Holmes C.; Hurt C.; Marriott A.; Mecocci P.; Nobili F.; Ousset P.J.; Reynish E.; Salmon E.; Tsolaki M.; Vellas B.; Robert P.H.	7007057244	15064974500	7102712188	35399907100	7003839024
Aasheim H.-C.; Pedeutour F.; Grosgeorge J.; Logtenberg T.	7003433292	6603543905	56219430800
Aasheim H.-C.; Pedeutour F.; Smeland E.B.	7003433292	35397579900
Abad P.
Abad P.; Cardin L.; Poupet A.; Ponchet M.	6602388443	6602779359	6603757981
Abad P.; Castagnone-Sereno P.; Rosso M.-N.; de Almeida Engler J.; Favery B.	55931175400	7005224485	7003408464	6507058549

In the second data file, I have authorid values ranging from 1-1046 in the same publication.

Tags: None

Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

16 Nov 2023, 11:32

In the second data, why are the number of author IDs disagreeing with the number of authors in the first column?

Also, please review the FAQ (https://www.statalist.org/forums/help) section 12 on how to use dataex command to post sample data.
Comment
Sreenivasa Prabhu

Join Date: Oct 2023

Posts: 8
#3

16 Nov 2023, 11:48

As I had mentioned, there are about 1046 rows of data for author ID. Not all are equal. For example, some publications may have only 2 or 3 authors where as some have hundreds of authors.
Above data is only a small subset of my dataset. It was difficult for me to copy the entire 1046 rows here and hence, it might seem like there is a mismatch. However, in the complete data set, there is no issues.
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#4

16 Nov 2023, 12:19

I am not asking why some publications have more authors and some fewer. I am asking why a publication with 4 authors only has 3 author ID. Like in the case of "Abad P.; Cardin L.; Poupet A.; Ponchet M.", there were only three IDs: "6602388443 6602779359 6603757981".
Comment
Sreenivasa Prabhu

Join Date: Oct 2023

Posts: 8
#5

16 Nov 2023, 12:39

Thanks Ken Chui . I get your point. That is the data from scopus. Might be some issues with downloading. I will have to look into it. However, I will be thankful if you have any insights on how I can go about with the merge function in such a case.
Comment

Ken Chui

Join Date: Aug 2014
Posts: 1058

16 Nov 2023, 14:20

I don't have any confident solution because the data are sensible to me. But if you meant how to reshape the wide form to long in data 2, there is a possible way:

Code:

clear
input str100 Authors     str15(authorsid001     authorsid002     authorsid003     authorsid004     authorsid005)
"Abad P.; Cardin L.; Poupet A.; Ponchet M."     6602388443     6602779359     6603757981         9999999999 .
"Abad P.; Castagnone-Sereno P.; Rosso M.-N.; de Almeida Engler J.; Favery B."     55931175400     7005224485     7003408464     6507058549     8888888888
end

gen id = _n

local ct = 1
foreach x of varlist authorsid001 - authorsid005{
    rename `x' authorsid`ct'
    local ct = `ct' + 1
}

split Authors, gen(author_name) parse(;)
drop Authors

reshape long authorsid author_name, i(id) j(index)

Maybe with that you can do something with the data 1. I am not sure what you meant by "merging" because it does not make sense to me to merge (aka add columns) authors and id to another data that also has authors and id. Maybe it was meant to be append, I am not sure.

It'd be great to show a final data to show us the wanted data set. But you need to fix that data 2 first.

Comment

Sreenivasa Prabhu

Join Date: Oct 2023

Posts: 8
#7

18 Nov 2023, 09:31

Thank you Ken Chui . These are two different data sets. In one I have the publication details like author, author IDs (like datset 2) title, year, affiliation etc.... In the second dataset, I have the details of the author and author Id alone. I have a third dataset, which has citation details of every publication i.e. annual citations received by a given publication for a duration of 40 years. For the third dataset, I have only the title. My idea was to match the dataset 1 and datset2, thereby generating a dataset which will have title and author ID and then merge ot with the third dataset, as I need to have total cittaion received by the author per year.

I tried the way you suggested to reshape the data. However, the data is too large and it returns error.
Comment

Announcement

Merge for multiple variables

Comment

Comment

Comment

Comment

Comment

Comment