list non-missing values

Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#1

list non-missing values

18 May 2022, 08:16

Hi all,

I would like to identify all the non-missing values of a variable without repetition of the values of the variables.
How do I go about this?

Best,
Tharshini
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3912
#2

18 May 2022, 08:22

How about

Code:

tabulate varname
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#3

18 May 2022, 09:52

well, that will give me all values of the variable and my dataset contains 12,215,798 observations. So what I would like to have is for a variable; Stata should identify all non-missing values. This is because I would like to label the values but since I have so many observations, I was thinking that there must be a code that identify such a problem as described.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#4

18 May 2022, 10:06

Did you try Daniel klein's suggestion?

The number of observations you have is not the issue. tabulate will list distinct values with their frequencies. If tabulate doesn't show you a small or moderate number of distinct integer values, defining value labels will not be practicable any way.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3912

18 May 2022, 10:36

Verbal explanations and verbal answers might not be the best way to proceed here.

Let's work with an example:

Code:

. sysuse auto
(1978 automobile data)

. describe , short

Contains data from C:\Program Files\Stata17\ado\base/a/auto.dta
 Observations:            74                  1978 automobile data
    Variables:            12                  13 Apr 2020 17:45
Sorted by: foreign

. tabulate rep78

     Repair |
record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00

As you can see, although the dataset has 74 observations, only 69 observations of the variable rep78 are non-missing. There are 5 distinct values.

Now let's blow up those 74 observations to bring it closer to what you are dealing with

Code:

. expand 162162
(11,999,914 observations created)

. tabulate rep78

     Repair |
record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |    324,324        2.90        2.90
          2 |  1,297,296       11.59       14.49
          3 |  4,864,860       43.48       57.97
          4 |  2,918,916       26.09       84.06
          5 |  1,783,782       15.94      100.00
------------+-----------------------------------
      Total | 11,189,178      100.00

As you can see, there are now 11,189,178 non-missing observations, there are still only 5 distinct values.

If you have a different problem, please create an example that illustrates what you have and what you want.

Comment

Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#6

19 May 2022, 02:21

I believe my problem is "somewhat" different or my explanation is not clear enough so I'll demonstrate with an example. The variable AstNr_LISA is an identifier for work sector, thus the values gives me the the work sector and I would like to label each values. However, as my variable contains too many unique values the following code returns too long results.

Code:

tabulate AstNr_LISA

.

Listed here below 10 out of 24092810 observations

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long AstNr_LISA 0 0 0 0 0 0 0 0 0 0 end

I found the following code

Code:

groups AstNr_LISA, order(h) select(15)

but not necessarily what I had in my mind or perhpas this is the code to go about but surely there is a code much more efficient ?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#7

19 May 2022, 02:33

groups is from the Stata Journal. What is it about the program and what it does that makes you want something "much more efficient"? What does efficiency mean here: computer speed, memory demand, your time?

I am still unclear about what you want here, but in short, you seem to be telling us that you have many distinct values (I believe that) and that you want a very concise report (I believe that) but that you can't get both at once (I believe that, as it's likely that the goals aren't compatible).

groups shows, in your application, the 15 most common values. If they are labelled, that will show up. It can do other things.
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#8

19 May 2022, 03:22

I guess you understand me perfectly, however, somehow I thought that it would be feasible to get a concise report but as you write it is not compatible.

That is correct, the groups shows the 15 most common values but does this means that I have to work through all unique values in this manner, that is, by using the code group and hit every 20 values and so on and label them ? This is where the work efficient kicks in, I was thinking that there might be clever coding that enable me to handle such data management?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#9

19 May 2022, 03:45

If I understand you perfectly it is only that I sense that your goals seem contradictory.

With groups you can look at the highest # in terms of frequency. There is no handle to look at distinct values between #1 and #2 in order of frequency.

You seem to want to attach value labels to some values but not others. That is perfectly legal so long as the values to be labelled are integers. I don't get a sense of where the value labels are coming from, a string variable in Stata, a different Stata dataset, or something else entirely.
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#10

20 May 2022, 08:34

So, minor corrections...I would like to attach labels on all values. The label values are what you see on the #6. The 0's are values which indicates a work sector and I have too many unique values which makes the work tedious as I cannot see all values on a row. This is why I asked you whether I can use groups to see every 20 values or so and label each and every one. But if I understood you correct, there is no solution to this problem?

The value label are found on a code book given to me.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3912
#11

20 May 2022, 09:53

If you have to manually (re-)type the labels for hundreds or thousands of values, listing those distinct values is the least of your problems in terms of effort. You will be spending hours no matter how concise you list those values. If you can have Stata read the value labels from a list, it's a different story. I am probably bailing out here as I just cannot infer the information I need to make useful suggestions.
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#12

02 Jun 2022, 03:01

So, I'll give another try to explain. I have a dataset with variables that I need to label values, the "problem" that I am facing is that these variables contains many distinct unique values which becomes tedious job and also high probability to human error. As previously stated my problem ( see the tread above) I was aiming for working manually, i.e give each distinct value a label despite the likelihood of typing error. However, after investing other possibilities, I have code book that I have converted from excelfile to stata file. I would like to merge these two datasets so that I do not need to manually type in labels to the variable. So, here is what I am facing now, I would like to merge the dataset that you see here below. However, in order to do so, I need to make sure that I have the same key variable in both datasets. In the first dataset (contains labels to all variables), there is VariableNamn which contains variable AstSNI92 and KlarText contains the labels to AstSNI92. However, I need to extract AstSNI92 into a variable together with its labels KlarText. I would like the two following AstSNI92 and KlarText to be a variable so that AstSNI92 becomes the key variable between the two datasets and feasible to merge both datasets.

Perhaps there is another (more efficient) solution(s), in which case I would be happy to hear.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str16 Tabell int Ar str15 VariabelNamn str206 KlarText str7 Kod long Antal "JH_Lev_LISA_1992" 1992 "AstSNI92" "Ej sysselsatt/okänd näringsgren" "00000" 23859 "JH_Lev_LISA_1992" 1992 "AstSNI92" "Spannmålsodling m.m." "01111" 2 "JH_Lev_LISA_1992" 1992 "AstSNI92" "Övrig odling av jordbruksväxter" "01119" 2 "JH_Lev_LISA_1992" 1992 "AstSNI92" "Odling av köksväxter på friland" "01121" 6 "JH_Lev_LISA_1992" 1992 "AstSNI92" "Odling av plantskoleväxter m.m. på friland" "01122" 26 end

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double Lopnr_FK str4 FamStF str6 AstSNI92 145174 "311" "01124" 58986 "13" "51570" 232749 "321" "00000" 65042 "23" "70201" 140296 "11" "70201" end
Comment

Announcement