Classifying variable types on a list of variables

General Tech Learning Aids/Tools 2 years ago

0 1 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Learning Aids/Tools related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (1)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

 

I have a list of around 700 variables which I need to perform a variable cleanup on. What complicates things is there are different numeric codes which flag an invalid value and these differ by the variable type. I wanted to see if I can use some form of unsupervised learning to aid in this task. Would appreciate any advice/suggestions.

Let me elaborate on what I'm working with.

When I mention variable type, they're all numeric, but I'm trying to classify them into categories like a dollar-amount, age, number of something based off of the name of the variable as the rules for invalid flag differ by those categories.

Because of this, I'd like to classify my variables into things like:

  • Dollar Amount
  • Age
  • Number of items
  • etc...

Here's an example of what invalid values look like:

Invalid values for a variable of type "Number of items":

  • 6,7,8,9

Invalid values for a variable of type "Dollar amounts":

  • 99996, 99997, 99999

Some additional points:

  • These variables have maximums, like 9 for number of things. But isn't the most reliable to filter by, as it could definitely affect other types like $ amounts.

  • The variable names can sometimes be telling of the type of variable

    1) It could have a keyword in the variable name like "N_" to indicate that variable is number of items.

    2) Sometimes the rule may not be so simple and can be confounded by other keywords, eg: N_ITEMS_PCT_50. This is actually a number of items with a percentage over 50% rather than a percentage value.

Some of the features I collected to help with measuring similarity:

1) The variable names, of course

2) Maximum values of each variable

3) # of times that an invalid flag (for all types) would appear for each variable. I would do this by calculating how many observations fall in the range of invalid values. So if I did this for "number of items", I would count number of observations ranging from 6 to 9. I would calculate another column to look for invalid dollar amounts by counting number of observations between 99996 to 99999.

I'm interested to see if this could be a viable approach as I'd try have my work cut out for me rather than making this a very manual process for 700 variables. Would appreciate any insight.

Thanks

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.