Popular Categories

How to combine text vector parameter with other parameters before feeding it to sklearn?

4.69K 2 0 0 0

Manpreet Singh

Previous Next

Posted on 16 Aug 2022, this text provides information on Bugs & Fixes related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Answers (2)

Post Answer

manpreet Best Answer 2 years ago

I'm trying to combine two types of parameters before clustering.

My parameters are Text - represented as sparse matrix, and another array representing other features of my data point.

I've tried to combine the 2 types of parameters into 1 array and passing it as an input to the algo:

db = DBSCAN(eps=1, min_samples=3, metric=get_distance).fit(array(combined_list))

Also I've built a custom distance metric which I'm going to use.

def get_distance(vec1,vec2):
    text_distance = cosine_similarity(vec1[0] ,vec2[0])
    other_distance = vec1[1]-vec2[1]

    return (text_distance+other_distance)/2

But I'm getting an error when trying to pass my input array. The combined array was constructed as following:

combined_list = []
for i in range(len(hashes_list)):
    combined_list.append((hashes_list[i],text_list[i]))

combined_list = array(combined_list)

Full Error Traceback:

db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(array(combined_list))

Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "", line 1, in <module>
  File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/cluster/dbscan_.py", line 319, in fit
    X = check_array(X, accept_sparse='csr')
  File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 527, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

Is this the correct approach for combining text vector with other parameters?

0 views

0 shares

$userId = is_array($answer) ? ($answer['user_id'] ?? null) : ($answer->user_id ?? null); $commentuser = getUserWithId($userId);

manpreet 2 years ago

I have couple of suggestions for your approach.

Input for DBSCAN has to be fed with array of 2D and not tuples. Hence you have to flatten your input data.

From Documentation:

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)

get_distance() has to return single value and not a array. Hence, I would suggest you to use some measure for not text features. I have given an example for euclidean distance.

Example:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> text_list = vectorizer.fit_transform(corpus)


import numpy as np
hashes_list = np.array([[12,12,12],
               [12,13,11],
               [12,1,16],
               [4,8,11]])

from scipy.sparse import hstack
combined_list = hstack((hashes_list,text_list))

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

from sklearn.cluster import DBSCAN

n1 = len(vectorizer.get_feature_names())

def get_distance(vec1,vec2):
    text_distance = cosine_similarity([vec1[:n1]], [vec2[:n1]])
    other_distance = euclidean_distances([vec1[n1:]], [vec2[n1:]])
    return (text_distance+other_distance)/2

db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(combined_list.toarray())

0 views 0 shares

No matter what stage you're at in your education or career, TuteeHUB will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.

Popular Categories

How to combine text vector parameter with other parameters before feeding it to sklearn?

Manpreet Singh

Answers (2)

manpreet Best Answer 2 years ago

manpreet 2 years ago

Similar Forum

Which operating system you favour and why?

What are the most popular tech portals in India?

What are best technologies available today for education / aiding learning?

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Important General Tech Links

Join Our Community Today