Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding and dropping strange patterns in data

    Hi everyone,

    I am trying to figure out a solution to finding 'strange' patterns in my data. I have a dataset of cellphones and I am trying to drop those observations where the phone numbers seem odd. For instance, I would like to be able to identify and flag observations within a variable called 'cellphone' that look like this:

    1. 5555252525
    2. 5555555555
    3. 5515151515
    4. 0123456789

    In these four cases, the numbers are either too repetitive or follow a series and thus it is highly unlikely that these are real phone numbers. All phone numbers in my dataset are 10 digits and a regular phone number could look something like this: 5559912630.

    Is there an easy way to do this? Is this even possible?

    Thank you so much for your help on this!

  • #2
    Please use dataex to present data examples in the future. There is no scientific solution to this problem and any algorithm will make several assumptions. I am no expert on any such algorithms or on regular expressions, but here is the little I can contribute. To identify repetitions of the same number, you can use


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input strL number
    "5555252525"
    "5555555555"
    "5515151515"
    "0123456789"
    "4412555531"
    "1111111111"
    "2222233322"
    "1010102020"
    end
    
    gen fake= ustrregexm(number,"([0-9])\1{4}")
    Here, the length of repetition would be 5 digits (i.e., 11111, 22222,..., 99999), the number inside the braces +1 since backslash (\1) counts repetitions backwards. You can change this to a different number when specifyng a repetition of different legth. You can do the same for repeated two digits, here of at least length 4, e.g., "5515151515", but be careful not to undo the step before when replacing

    Code:
    replace fake = ustrregexm(number,"([0-9][0-9])\1{3}") if fake!=1

    Code:
    . gen fake= ustrregexm(number,"([0-9])\1{4}")
    
    . replace fake = ustrregexm(number,"([0-9][0-9])\1{3}") if fake!=1
    (1 real change made)
    
    . l, clean
    
               number   fake  
      1.   5555252525      0  
      2.   5555555555      1  
      3.   5515151515      1  
      4.   0123456789      0  
      5.   4412555531      0  
      6.   1111111111      1  
      7.   2222233322      1  
      8.   1010102020      0
    Ultimately, I think you need to define what constitutes "real" and then move forward from there. I benefit a lot from the posts of Robert Picard, so maybe he can give you some more tips if he views this thread.

    Comment


    • #3
      I'm not a fan of rejecting data simply because they appear highly unlikely. The current winning combination of Michigan's Classic Lotto 47 is 4 10 19 29 34 39. Was 2 4 6 8 10 12 less likely to win?

      Since these are phone numbers, you should start with what is known. Look-up valid area codes. In the US, the rules do not permit area codes that start with 0 or 1. The second digit cannot be a 9 (reserved for future use). Further, area codes that end with the same two digits (e.g. 255 377) are also reserved.

      So from the example numbers in #1, only 5515151515 would pass the area code test. If you jump to the North American Numbering Plan Wikipedia page, you will see that the next three digits identify the central office (exchange code). The first digit must range from [2-9] while the second and third can include any digit except that they can't both be 1 (e.g. 311 611). Also, the 555 exchange is reserved but only from 555-0100 to 555-0199. Finally, some (many) exchange codes are not assigned within an area code.

      The next step would then be to get a list of utilized exchange codes. You can get these for each state from this NANPA site. Here's a few observations for the 551 area code around the 515 exchange:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str3 state int(npa nxx) str3 use str5 ocn str49 companyname str11 ratecenter str2 initialgrowth str11(assigneddate effectivedate) str2(pooledcode inservice)
      " NJ" 551 427 " AS" " 6391" " CELLCO PARTNERSHIP DBA VERIZON WIRELESS - NJ" " RAMSEY"     " G" " 12/19/2001" " "           " N" " "
      " NJ" 551 444 " AS" " 6391" " CELLCO PARTNERSHIP DBA VERIZON WIRELESS - NJ" " FAIR LAWN"  " "  " "           " "           " Y" " "
      " NJ" 551 449 " AS" " 050J" " MEDIA THREE CORPORATION - NJ"                 " HACKENSACK" " I" " 03/26/2018" " "           " Y" " N"
      " NJ" 551 482 " AS" " 4036" " NEW CINGULAR WIRELESS PCS, LLC - DC"          " UNION CITY" " I" " 05/06/2002" " "           " Y" " "
      " NJ" 551 486 " AS" " 6391" " CELLCO PARTNERSHIP DBA VERIZON WIRELESS - NJ" " HACKENSACK" " G" " 12/19/2001" " "           " N" " "
      " NJ" 551 497 " AS" " 6391" " CELLCO PARTNERSHIP DBA VERIZON WIRELESS - NJ" " HACKENSACK" " G" " 06/13/2005" " "           " Y" " "
      " NJ" 551 500 " AS" " 991G" " VOXBEAM TELECOMMUNICATIONS INC. - NJ"         " RAMSEY"     " I" " 04/03/2014" " "           " Y" " "
      " NJ" 551 502 " AS" " 4036" " NEW CINGULAR WIRELESS PCS, LLC - DC"          " RIDGEWOOD"  " I" " 06/09/2017" " "           " Y" " "
      " NJ" 551 511 " UA" " "     " "                                             " "           " "  " "           " "           " N" " N"
      " NJ" 551 516 " UA" " "     " "                                             " "           " "  " "           " "           " N" " N"
      " NJ" 551 551 " UA" " "     " "                                             " "           " "  " "           " "           " N" " N"
      " NJ" 551 555 " UA" " "     " "                                             " "           " "  " "           " "           " N" " N"
      " NJ" 551 556 " AS" " 6623" " OMNIPOINT COMMUNICATIONS, INC. - NJ"          " UNION CITY" " I" " 11/13/2002" " "           " Y" " "
      " NJ" 551 557 " AS" " 999H" " BLUEBIRD COMMUNICATIONS, LLC - NJ"            " PARK RIDGE" " I" " 09/11/2018" " 11/02/2018" " Y" " N"
      " NJ" 551 574 " AS" " 4036" " NEW CINGULAR WIRELESS PCS, LLC - DC"          " CLIFFSIDE"  " I" " 04/15/2002" " "           " N" " "
      end
      Based on the above, 551-515-1515 is not a valid number.

      So if you take the time to build the database for the 50 states, you can reject phone numbers based on the first 6 digits using a simple merge. There are probably other sources for the same data. Searching a bit deeper, you can download the utilized code data from this page.
      Last edited by Robert Picard; 26 Sep 2018, 10:07.

      Comment

      Working...
      X