Popular Categories

Classifying variable types on a list of variables

General Tech Learning Aids/Tools 3 years ago

9.35K 1 0 0 0

Manpreet Singh

Previous Next

Posted on 16 Aug 2022, this text provides information on Learning Aids/Tools related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Answers (1)

Post Answer

manpreet Best Answer 3 years ago

I have a list of around 700 variables which I need to perform a variable cleanup on. What complicates things is there are different numeric codes which flag an invalid value and these differ by the variable type. I wanted to see if I can use some form of unsupervised learning to aid in this task. Would appreciate any advice/suggestions.

Let me elaborate on what I'm working with.

When I mention variable type, they're all numeric, but I'm trying to classify them into categories like a dollar-amount, age, number of something based off of the name of the variable as the rules for invalid flag differ by those categories.

Because of this, I'd like to classify my variables into things like:

Dollar Amount
Age
Number of items
etc...

Here's an example of what invalid values look like:

Invalid values for a variable of type "Number of items":

6,7,8,9

Invalid values for a variable of type "Dollar amounts":

99996, 99997, 99999

Some additional points:

These variables have maximums, like 9 for number of things. But isn't the most reliable to filter by, as it could definitely affect other types like $ amounts.
The variable names can sometimes be telling of the type of variable

1) It could have a keyword in the variable name like "N_" to indicate that variable is number of items.

2) Sometimes the rule may not be so simple and can be confounded by other keywords, eg: N_ITEMS_PCT_50. This is actually a number of items with a percentage over 50% rather than a percentage value.

Some of the features I collected to help with measuring similarity:

1) The variable names, of course

2) Maximum values of each variable

3) # of times that an invalid flag (for all types) would appear for each variable. I would do this by calculating how many observations fall in the range of invalid values. So if I did this for "number of items", I would count number of observations ranging from 6 to 9. I would calculate another column to look for invalid dollar amounts by counting number of observations between 99996 to 99999.

I'm interested to see if this could be a viable approach as I'd try have my work cut out for me rather than making this a very manual process for 700 variables. Would appreciate any insight.

Thanks

0 views

0 shares

No matter what stage you're at in your education or career, TuteeHUB will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.

Popular Categories

Classifying variable types on a list of variables

Manpreet Singh

Answers (1)

manpreet Best Answer 3 years ago

Similar Forum

Which operating system you favour and why?

What are the most popular tech portals in India?

What are best technologies available today for education / aiding learning?

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Important General Tech Links

Join Our Community Today