Merging two datasets and making a scatter plot with grouped/paired replies

Mathias WK

Join Date: Apr 2023

Posts: 3
#1

Merging two datasets and making a scatter plot with grouped/paired replies

29 Apr 2023, 13:45

Hello everyone!

I am currently working on a project where I am supposed to merge two datasets and make a scatter plot.

One dataset has a leader index containing self evaluated data on performance. Another dataset contains an employee index with data on employees' perceived performance of the leader. This is a big organization with about 30 different subgroups where each subgroup contains one leader and several employees in each.

How do I group every specific leader with the specific employees from his/her department?

And how do I make a scatter plot where I combine the evaluations from the leaders on the y axis and the evaluations from the employees on the x axis?

I do not see merge as the only command I need to use. I also considered joinby but I do not see it as the relevant command since I do not want to combine but group/pair.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

29 Apr 2023, 15:26

Well, if each department has only a single leader, then the first data set could contain just one observation per department. So it seems like you would do something like:

Code:

use leader_index_data_set, clear merge 1:m department using employee_index_data_set

But if the data structure is more complicated that may not work. Writing code for imaginary data sets isn't really a good use of anybody's time. If the above code will not solve your problem, then please post back showing example data from both data sets. Choose the examples so that there will be some matches across them.

Be sure to use the -dataex- command to show the example dataex. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Mathias WK

Join Date: Apr 2023
Posts: 3

29 Apr 2023, 16:51

Thank you for your quick help, Clyde.
I do not know how to use the dataex in this case...

My situation is the following:

I have two datasets.

This is a row from the leader-dataset "leaderset":

ID	var1	var2	var3	performance1	performance2	performance3	var7	var8	var9	department	var10	var11
1	44	21	96	3	5	5	1	3	5	1	2	3
2	35	21	74	4	2	3	3	1	2	7	2	3

This is a row from the employee-dataset "employeeset":

ID	var1	var2	var3	var4	var5	var6	performance1(perceived)	performance2(perceived)	performance3(perceived)	var10	var11	var12	var13	var14	var15	var16	var17	var18	var19	department
1	47	17	3	5	3	3	2	3	4	3	2	2	1	3	1	1	3	3	3	1
2	43	42	3	5	2	3	4	3	5	2	1	3	4	4	4	1	1	2	3	7
3	44	42	3	2	4	3	2	4	2	2	3	4	5	3	4	3	3	3	1	1

Performance1, performance2, performance3 are used to create an index for the leader's own perception of his/her performance

* recoding Likert-scale so high values illustrate high score.
recode performance1 (1=5) (2=4) (3=3) (4=2) (5=1), generate(performance1a)
recode performance2 (1=5) (2=4) (3=3) (4=2) (5=1), generate(performance2a)
recode performance3 (1=5) (2=4) (3=3) (4=2) (5=1), generate(performance3a)

*Generating index for leader's own perception of performance
gen indexleader=(performance1a + performance2a + performance3a)

performance1(perceived), performance2(perceived), performance3(perceived) are used to create an index for the employees' perception of the leaders performance

* recoding Likert-scale so high values illustrate high score.
recode performance1(perceived) (1=5) (2=4) (3=3) (4=2) (5=1), generate(performanceperceived1)
recode performance2(perceived) (1=5) (2=4) (3=3) (4=2) (5=1), generate(performanceperceived2)
recode performance3(perceived) (1=5) (2=4) (3=3) (4=2) (5=1), generate(performanceperceived3)

*Generating index for employees' perception of leader's performance
gen indexemployee=(performanceperceived1 + performanceperceived2 + performanceperceived3)

What I need to know now is how I combine the two datasets so the leaders and the employees can be coupled. I need the index of the leader from department 1 to be connected to the index of the two employees from department 1 in the merged dataset. But they should not be combined. My wish is to show the index from the leaders on the y axis and the index from the employees on the x axis along with a trend line. I also want the index of the leader from department 7 to be connected to the index of the employee from department 7.

Should I then use the command

use leaderset, clear
merge 1:m department using employeeset

What comes next? I do not know if I should construct the indexes separately in the datafiles "leaderset" and "employeeset" or if I should wait until they are merged together.

Thank you!

Last edited by Mathias WK; 29 Apr 2023, 16:57.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#4

29 Apr 2023, 19:40

Should I then use the command

use leaderset, clear
merge 1:m department using employeeset

Yes, I believe that will work. It depends on whether, as in your example, department uniquely identifies observations in the leaderset. If that is not the case, then you would have to do -joinby department using employeeset- instead.

I do not know if I should construct the indexes separately in the datafiles "leaderset" and "employeeset" or if I should wait until they are merged together.

It would be slightly more computationally efficient to construct the indices separately and then combine the data sets. But unless your data set is gigantic, the difference in speed will not be noticeable.

I do not know how to use the dataex in this case...

I'm sorry, but I cannot believe that a person who can write the code you show for scoring the Likert scales can't figure out how to use -dataex-. Just run -help dataex- and follow the instructions.
Comment
Mathias WK

Join Date: Apr 2023

Posts: 3
#5

30 Apr 2023, 02:16

Thank you for your help! It is very appreciated.

Originally posted by Clyde Schechter View Post

I'm sorry, but I cannot believe that a person who can write the code you show for scoring the Likert scales can't figure out how to use -dataex-. Just run -help dataex- and follow the instructions.

The reason for not using the dataex is simply that my dataset is made with confidential data due to the respondents wishes which means I cannot post the original codes and data.

Originally posted by Clyde Schechter View Post

Yes, I believe that will work. It depends on whether, as in your example, department uniquely identifies observations in the leaderset. If that is not the case, then you would have to do -joinby department using employeeset- instead.

I have now tried to run the command but it says that the department variable does not exist. I do not understand this since both employeeset and leaderset contain the "department" variable. But I have constructed the indices in each dataset now.

I cannot merge the two for some reason. I have tried both the "joinby" and the "merge" command but none of them work. What I understand by the command merge 1:m is that it will delete all the following respones which means that ID3 from my example would be deleted if the merge was completed because of the "one-to-many" logic. Is that correct?

My wish is to combine alle the answer scores of the employees from one department on the x axis and and the one answer score from the leader on the y axis.

I hope this is understandable.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#6

30 Apr 2023, 08:35

have now tried to run the command but it says that the department variable does not exist. I do not understand this since both employeeset and leaderset contain the "department" variable. But I have constructed the indices in each dataset now.

Without example data, and exact code and output for context, it is hard to give specific help here. One possibility is that the variables you are calling department in the two data sets are not exactly the same. Stata is case sensitive. So if one of them is department and the other is Department, they will not match. So check for that, or perhaps one of them is misspelled. The two names must be exactly the same in all respects to match.

What I understand by the command merge 1:m is that it will delete all the following respones which means that ID3 from my example would be deleted if the merge was completed because of the "one-to-many" logic. Is that correct?

No, that's not correct. -merge- does not delete anything unless you specifically tell it to with the -keep()- option (in which case it looses what you don't choose to keep). The same is true of -joinby-, except the name of the option is -unmatched()- instead of -keep()-. So, unless you use those options to explicitly direct Stata to remove certain observations from the result, you can always count on the result of -merge- or -joinby- to contain all of the data from both of the data sets being combined.

The reason for not using the dataex is simply that my dataset is made with confidential data due to the respondents wishes which means I cannot post the original codes and data.

This comes up fairly often; many people work with confidential data. But there's a simple workaround. Example data needs to have the same structure and metadata as the real thing in order to be helpful; but the actual data values are, in most situations, not important. So you can just replace the actual data values with random numbers. The only place where you would have to be more careful than that in your situation is with the department variable--there the actual values matter. But having a fake data set where the department variable itself is real wouldn't provide anybody with usable information.

Last edited by Clyde Schechter; 30 Apr 2023, 08:38.
Comment

Announcement