Merging two data files

christine ellison

Join Date: Nov 2018

Posts: 8
#1

Merging two data files

23 Nov 2018, 10:06

Newbie here.

I am trying to merge two files - one is a headteacher survey from 200 schools. The second is a parent survey carried out with 6 parents from 60 schools (so 360 observations in total) randomly selected from the 200. Both have the variable SchoolName as the identifier.

I have tried the following:

Using the parent file as the master one and the command "merge 1:m SchoolName using" the parent file. I've also tried the command merge m:1. It keeps returning the message "that variable SchoolName does not uniquely identify observations".

I have searched for duplicates and found none.

I've checked SchoolName is stored as a number in both files.

Thank you for your help.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

23 Nov 2018, 11:11

Welcome to Statalist.

I'm guessing your search for duplicates in your headteacher survey was not done correctly. If you didn't specify a variable list, the duplicates command will look for observations for which every variable is identical. The commands you should start with are

Code:

duplicates report SchoolName duplicates examples SchoolName

run in the dataset with your headteacher survey. (With six parents from each school, we know SchoolName will not uniquely identify observations in the parent survey).

Wit that said, there is much you could do to improve the likelihood of a useful answer. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

Section 12.1 is particularly pertinent

12.1 What to say about your commands and your problem

Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!

It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], and to use the dataex command to provide sample data, as described in section 12 of the FAQ.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
Comment
christine ellison

Join Date: Nov 2018

Posts: 8
#3

26 Nov 2018, 07:26

Thank you very much for your help. This is my first time using Stata and I am trying to teach myself so I really appreciate your help.

I ran the commands you suggested and received the following

. duplicates report SchoolName

Duplicates in terms of SchoolName

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 198 0
2 | 2 1
--------------------------------------

. duplicates examples SchoolName

Duplicates in terms of SchoolName

+-------------------------+
| # e.g. obs School~e |
|-------------------------|
| 2 199 201 |
+-------------------------+

I then tried

. duplicates drop

Duplicates in terms of all variables

(0 observations are duplicates)

. duplicates SchoolName drop
illegal duplicates subcommand
r(198);

. list SchoolName

+----------+
| School~e |
|----------|
1. | 1 |
2. | 2 |
3. | 3 |
4. | 4 |
5. | 5 |
|----------|
6. | 6 |
7. | 7 |
8. | 8 |
9. | 9 |
10. | 10 |
|----------|
11. | 11 |
12. | 12 |
13. | 13 |
14. | 14 |
15. | 15 |
|----------|
16. | 16 |
17. | 17 |
18. | 18 |
19. | 19 |
20. | 20 |
|----------|
21. | 21 |
22. | 22 |
23. | 23 |
24. | 24 |
25. | 25 |
|----------|
26. | 26 |
27. | 27 |
28. | 28 |
29. | 29 |
30. | 30 |
|----------|
31. | 31 |
32. | 32 |
33. | 33 |
34. | 34 |
35. | 35 |
|----------|
36. | 36 |
37. | 37 |
38. | 38 |
39. | 39 |
40. | 40 |
|----------|
41. | 41 |
42. | 42 |
43. | 43 |
44. | 44 |
45. | 45 |
|----------|
46. | 46 |
47. | 47 |
48. | 48 |
49. | 49 |
50. | 50 |
|----------|
51. | 51 |
52. | 52 |
53. | 53 |
54. | 54 |
55. | 55 |
|----------|
56. | 56 |
57. | 57 |
58. | 58 |
59. | 59 |
60. | 60 |
|----------|
61. | 61 |
62. | 62 |
63. | 63 |
64. | 64 |
65. | 65 |
|----------|
66. | 66 |
67. | 67 |
68. | 68 |
69. | 69 |
70. | 70 |
|----------|
71. | 71 |
72. | 72 |
73. | 73 |
74. | 74 |
75. | 75 |
|----------|
76. | 76 |
77. | 77 |
78. | 78 |
79. | 79 |
80. | 80 |
|----------|
81. | 81 |
82. | 82 |
83. | 83 |
84. | 84 |
85. | 85 |
|----------|
86. | 86 |
87. | 87 |
88. | 88 |
89. | 89 |
90. | 90 |
|----------|
91. | 91 |
92. | 92 |
93. | 93 |
94. | 94 |
95. | 95 |
|----------|
96. | 96 |
97. | 97 |
98. | 98 |
99. | 99 |
100. | 100 |
|----------|
101. | 101 |
102. | 102 |
103. | 103 |
104. | 104 |
105. | 105 |
|----------|
106. | 106 |
107. | 107 |
108. | 108 |
109. | 109 |
110. | 110 |
|----------|
111. | 111 |
112. | 112 |
113. | 113 |
114. | 114 |
115. | 115 |
|----------|
116. | 116 |
117. | 117 |
118. | 118 |
119. | 119 |
120. | 120 |
|----------|
121. | 121 |
122. | 122 |
123. | 123 |
124. | 124 |
125. | 125 |
|----------|
126. | 126 |
127. | 127 |
128. | 128 |
129. | 129 |
130. | 130 |
|----------|
131. | 131 |
132. | 132 |
133. | 133 |
134. | 134 |
135. | 135 |
|----------|
136. | 136 |
137. | 137 |
138. | 138 |
139. | 140 |
140. | 141 |
|----------|
141. | 142 |
142. | 143 |
143. | 144 |
144. | 145 |
145. | 146 |
|----------|
146. | 147 |
147. | 148 |
148. | 149 |
149. | 150 |
150. | 151 |
|----------|
151. | 152 |
152. | 153 |
153. | 154 |
154. | 155 |
155. | 156 |
|----------|
156. | 157 |
157. | 158 |
158. | 159 |
159. | 160 |
160. | 161 |
|----------|
161. | 162 |
162. | 163 |
163. | 164 |
164. | 165 |
165. | 166 |
|----------|
166. | 167 |
167. | 168 |
168. | 169 |
169. | 170 |
170. | 171 |
|----------|
171. | 172 |
172. | 173 |
173. | 174 |
174. | 175 |
175. | 176 |
|----------|
176. | 177 |
177. | 179 |
178. | 180 |
179. | 181 |
180. | 182 |
|----------|
181. | 183 |
182. | 184 |
183. | 185 |
184. | 186 |
185. | 187 |
|----------|
186. | 188 |
187. | 189 |
188. | 190 |
189. | 191 |
190. | 192 |
|----------|
191. | 193 |
192. | 194 |
193. | 195 |
194. | 196 |
195. | 197 |
|----------|
196. | 198 |
197. | 199 |
198. | 200 |
199. | 201 |
200. | 201 |
+----------+

Does that mean the issue is with schoolName 138? I thought it might, so I tried this,

. duplicates drop if SchoolName=38
=exp not allowed
r(101);
Comment
christine ellison

Join Date: Nov 2018

Posts: 8
#4

26 Nov 2018, 07:31

sorry, that last bit should say

duplicates drop if SchoolName=138
=exp not allowed
r(101);
Comment
christine ellison

Join Date: Nov 2018

Posts: 8
#5

26 Nov 2018, 07:47

I have also tried

. duplicates drop if 'SchoolName="138"
'SchoolName invalid name
r(198);

. duplicates drop if SchoolName="138"
=exp not allowed
r(101);

. duplicates drop if 'SchoolName'=138
'SchoolName' invalid name
r(198);

. 'duplicates drop, force'
' is not a valid command name
r(199);

. duplicates drop, force
options not allowed
r(101);
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

26 Nov 2018, 08:06

This is my first time using Stata and I am trying to teach myself

I'm sympathetic to you as a new user of Stata - it's a lot to absorb. And even worse if perhaps you are under pressure to produce some output quickly. Nevertheless, I'd like to encourage you to take a step back from your immediate tasks.

When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.

Onward with your problem.

Let's start with your final point

Code:

duplicates drop if SchoolName=138 =exp not allowed r(101);

In Stata, to compare two values, you would write

Code:

duplicates drop if SchoolName==138

This is truly basic Stata syntax, and you will learn it best by following the reading plan described above. But this was not the command you should have run in any event, so don't rush back to the program to run the corrected version. Read on.

You ran the two commands I suggested, but you apparently did not understand the output that was reported, and did not read further in the output of help duplicates or in the PDF documentation to try to understand what was reported and how the duplicates command is to be used.

Here is your work, presented readbly using CODE delimiters as recommended in post #2.

Code:

. duplicates report SchoolName Duplicates in terms of SchoolName -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 198 0 2 | 2 1 -------------------------------------- . duplicates examples SchoolName Duplicates in terms of SchoolName +-------------------------+ | # e.g. obs School~e | |-------------------------| | 2 199 201 | +-------------------------+

The first command tells us that you have 198 distinct values of SchoolName, and one SchoolName which appears twice.

The second command tells us observation 199 is a sample of an observation with a duplicated value of SchoolName, and that the duplicated value is 201.

Nothing indicates that SchoolName 138 was the problem. The output you listed is sorted by SchoolName, which makes it clear that SchoolName 138 appears only one time. Observations 199 and 200 contain the same value of SchoolName - 201.

Your next step should be something like

Code:

list if SchoolName==201

You will find two observations that will differ in some of the variables. Your problem is to decide what to do about those observations. Is one obviously incorrect, in which case you can drop it. If not, you need to decide how to choose between the two observations, or whether to combine them somehow. This is something you will have to do based on your understanding of your data, not something that someone else unfamiliar with your data can tell you what to do.

Added in edit: This post crossed with your post #5. You cannot just guess Stata syntax, any more than you can guess how a foreign language is spoken. Study is crucial. Read the manuals.

Last edited by William Lisowski; 26 Nov 2018, 08:08.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#7

26 Nov 2018, 08:59

Like William, I'm sympathetic to you as a new user, particularly because you are obviously *trying* to help yourself by reading the -help- and other documentation. (Not everyone who is having problems seems to try reading, so good for you---really.)

A couple of more comments:
1) It looks like your SchoolName is a string variable, not a numeric variable. Any reference to its values must therefore enclose them in quotes, e.g.

Code:

list if SchoolName == "201"

Unlike spreadsheets, most data analysis programs make a strong distinction between string variables and numeric variables, and expect you to indicate strings with quotes around their values.

2) Your use of -duplicates examples- was a good idea, and tells what I think the problem is:

Code:

duplicates examples SchoolName Duplicates in terms of SchoolName +-------------------------+ | # e.g. obs School~e | |-------------------------| | 2 199 201 | +-------------------------+

Somewhat obscurely, this was showing that two observations have the same value for SchoolName, namely BLANK. If the duplicating SchoolName had been something else, that command would have given something like the following, showing two observations with a value of "Lincoln" for SchoolName:

Code:

duplicates examples SchoolName Duplicates in terms of SchoolName +-------------------------+ | # e.g. obs School~e | |-------------------------| | 2 199 201 Lincoln | +-------------------------+

3) As an alternative to requesting examples of duplicates, I personally prefer to use the -tag- option of -duplicates- and then browse the data to look at them in the data editor. Give the following a try and I think it will show your problem more clearly by letting you look at all of your observations that are duplicates:

Code:

duplicates tag SchoolName, generate(dupes) browse if dupes

I suspect this will show you that you have two observations that are completely blank on all the variables, and which case you can indeed fix that in your data file.

Last edited by Mike Lacy; 26 Nov 2018, 09:04.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

26 Nov 2018, 09:31

I don't like to contradict other responses in public, but Mike Lacy has been confused by the presentation in post #3 of the output of the duplicates command without using CODE delimiters. In post #7 Mike tried reconstructing the original output from post #2 that was pasted without CODE delimiters, as did I, and he comes up with

Code:

duplicates examples SchoolName Duplicates in terms of SchoolName +-------------------------+ | # e.g. obs School~e | |-------------------------| | 2 199 201 | +-------------------------+

But "e.g. obs" is a single column heading, not two column headings, as the following example—enclosed in CODE delimiters from the start—demonstrates.

Code:

. sysuse auto, clear (1978 Automobile Data) . duplicates examples foreign Duplicates in terms of foreign +-----------------------------------+ | group: # e.g. obs foreign | |-----------------------------------| | 1 52 1 Domestic | | 2 22 53 Foreign | +-----------------------------------+

So the discussion of string variables and the like is beside the point. I stand on my recommendations in post #6. I did not suggest duplicates tag, although I have done so elsewhere, because in this case, where we can readily see the single duplicated value, I did not want to introduce another duplicates command to be misunderstood as my first two in post #2 were.
1 like
Comment
christine ellison

Join Date: Nov 2018

Posts: 8
#9

26 Nov 2018, 10:02

Thank you very much again for your help. I can now see that observation 199 and 200 are both labelled 201. It looks like a mistake was made in labeling SchoolName where number 138 and 178 were skipped out. I'm assuming I need to replace along the lines of:

replace SchoolName=178 if SchoolName==201

I can see that I need something else as that would change both the observations currently named 201 to 178. I can't figure out how?

Also, the reason I am trying to sort out the duplicates is because I want to merge the file with another one based on the SchoolName variable. If I change the variables on this file, do I have to do the same to the other one?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

26 Nov 2018, 10:23

The missing school codes are 139 and 178.

How do you know which 201 is really supposed to be 139 and which is supposed to be 178?

Based on what you wrote in post #1 it seems likely that you also have duplicates in your Parent Survey dataset. In that case, you expect multiple parents, and thus multiple observations, for each school, so you need to find duplicated school/parent values and reassign them appropriately. That would start with something like

Code:

duplicates report SchoolName ParentID duplicates examples SchoolName ParentID duplicates tag SchoolName ParentID, generate(dupes) browse if dupes

If again the duplicated school codes are 139 and 178, the same question pertains: how do you know which observations are really supposed to be 139 and which are supposed to be 178?
Comment
christine ellison

Join Date: Nov 2018

Posts: 8
#11

26 Nov 2018, 10:30

Thank you again. I have gone back again to the organisation that collected the data to try and get some more information from them.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#12

26 Nov 2018, 10:50

William-- Absolutely no need to be reluctant to correct me in public, and Christine, I'm sorry for any confusion I introduced. I've encountered blank or missing values as a confusing source of duplicates before, in others' and my own work, so I fell into a confirmation bias there.
Comment

Announcement

Merging two data files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment