Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Conditional Split and Combine

    I have a dataset consisting of Author Names and Author ID from Scopus. The problem I am facing is that, each column had multiple entry. I am copying a section of my data below for clarification.
    Authors Author(s) ID
    Abbott N., Krafft M.P. 57210774564;7006041188;
    Abboud F., Damiano J.P., Papiernik A. 7102796864;16050119900;7003385210;
    Abboud F., Damiano J.P., Papiernik A. 7102796864;16050119900;7003385210;
    Abboud F., Damiano J.P., Papiernik A. 7102796864;16050119900;7003385210;
    Abboud F., Damiano J.P., Papiernik A. 7102796864;16050119900;7003385210;
    In the current dataset the entries are as follows. (A1, A2, A3, .....) (B1, B2, B3, .....). I want to split this data such that I have Individual Author and their Ids and separated i.e. (A1, B1), A2, B2) ......

    P.S: I am relatively new to Stata and have never tried anything similar to this.

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str37 Authors str34 authorsid
    "Abbott N., Krafft M.P."                "57210774564;7006041188;"          
    "Abboud F., Damiano J.P., Papiernik A." "7102796864;16050119900;7003385210;"
    "Abboud F., Damiano J.P., Papiernik A." "7102796864;16050119900;7003385210;"
    "Abboud F., Damiano J.P., Papiernik A." "7102796864;16050119900;7003385210;"
    "Abboud F., Damiano J.P., Papiernik A." "7102796864;16050119900;7003385210;"
    end
    
    contract Authors authorsid, nomiss freq(count)
    replace count= length(authorsid)- length(subinstr(authorsid, ";","", .))
    expand count
    bys Authors: gen name = word(subinstr(subinstr(trim(itrim(Authors)), " ", "*", .), ",", " ", .), _n)
    replace name= trim(itrim(subinstr(name, "*", " ", .)))
    by Authors: gen id = word(subinstr(trim(itrim(authorsid)), ";", " ", .), _n)
    keep name id
    Res.:

    Code:
    . l
    
         +----------------------------+
         |         name            id |
         |----------------------------|
      1. |    Abbott N.   57210774564 |
      2. |  Krafft M.P.    7006041188 |
      3. |    Abboud F.    7102796864 |
      4. | Damiano J.P.   16050119900 |
      5. | Papiernik A.    7003385210 |
         +----------------------------+
    Last edited by Andrew Musau; 16 Oct 2023, 04:44.

    Comment


    • #3
      Thanks Andrew Musau . I tried to run the code. However, I have a large dataset and when I run the code, I still get teh results for only 5 columns as dispalyed in your response. How do I extend my analysis to the complete file?

      Comment


      • #4
        I do not follow your response. Use dataex to provide a larger example that illustrates the difficulties in extending #2.

        Comment


        • #5
          Andrew Musau I figured the first part, However, it shows error (pasted below) while running the code.

          contract Authors authorsid, nomiss freq(count)
          'contract' cannot be read as a number
          1. replace count= length(authorsid)- length(subinstr(authorsid, ";","", .))
          'replace' cannot be read as a number
          1. expand count
          'expand' cannot be read as a number
          1. bys Authors: gen name = word(subinstr(subinstr(trim(itrim(Authors)), " ", "*", .), ",", " ", .)
          > , _n)
          'bys' cannot be read as a number
          1. replace name= trim(itrim(subinstr(name, "*", " ", .)))
          'replace' cannot be read as a number
          1. by Authors: gen id = word(subinstr(trim(itrim(authorsid)), ";", " ", .), _n)
          'by' cannot be read as a number
          1. keep name id
          'keep' cannot be read as a number
          1.
          unexpected end of file
          r(612);

          Comment


          • #6
            #2 assumes that the data is loaded into Stata and then you just run

            contract Authors authorsid, nomiss freq(count)
            replace count= length(authorsid)- length(subinstr(authorsid, ";","", .))
            expand count
            bys Authors: gen name = word(subinstr(subinstr(trim(itrim(Authors)), " ", "*", .), ",", " ", .), _n)
            replace name= trim(itrim(subinstr(name, "*", " ", .)))
            by Authors: gen id = word(subinstr(trim(itrim(authorsid)), ";", " ", .), _n)
            keep name id
            There should be no inputting of data as implied by #5. Get a colleague to help you load the data and run the code in #2 if you have issues. Alternatively, load the dataset and copy and paste the output of

            Code:
            dataex in 1/20
            if you cannot obtain results with #2.

            Comment


            • #7
              Thank you Andrew

              Comment

              Working...
              X