How to combine text vector parameter with other parameters before feeding it to sklearn?

General Tech Bugs & Fixes 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Bugs & Fixes related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

 

I'm trying to combine two types of parameters before clustering.

My parameters are Text - represented as sparse matrix, and another array representing other features of my data point.

I've tried to combine the 2 types of parameters into 1 array and passing it as an input to the algo:

db = DBSCAN(eps=1, min_samples=3, metric=get_distance).fit(array(combined_list))

Also I've built a custom distance metric which I'm going to use.

def get_distance(vec1,vec2):
    text_distance = cosine_similarity(vec1[0] ,vec2[0])
    other_distance = vec1[1]-vec2[1]

    return (text_distance+other_distance)/2

But I'm getting an error when trying to pass my input array. The combined array was constructed as following:

combined_list = []
for i in range(len(hashes_list)):
    combined_list.append((hashes_list[i],text_list[i]))

combined_list = array(combined_list)

Full Error Traceback:

db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(array(combined_list))

Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "", line 1, in <module>
  File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/cluster/dbscan_.py", line 319, in fit
    X = check_array(X, accept_sparse='csr')
  File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 527, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

Is this the correct approach for combining text vector with other parameters?

profilepic.png
manpreet 2 years ago

 

I have couple of suggestions for your approach.

  1. Input for DBSCAN has to be fed with array of 2D and not tuples. Hence you have to flatten your input data.

From Documentation:

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)

  1. get_distance() has to return single value and not a array. Hence, I would suggest you to use some measure for not text features. I have given an example for euclidean distance.

Example:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> text_list = vectorizer.fit_transform(corpus)


import numpy as np
hashes_list = np.array([[12,12,12],
               [12,13,11],
               [12,1,16],
               [4,8,11]])

from scipy.sparse import hstack
combined_list = hstack((hashes_list,text_list))

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

from sklearn.cluster import DBSCAN

n1 = len(vectorizer.get_feature_names())

def get_distance(vec1,vec2):
    text_distance = cosine_similarity([vec1[:n1]], [vec2[:n1]])
    other_distance = euclidean_distances([vec1[n1:]], [vec2[n1:]])
    return (text_distance+other_distance)/2

db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(combined_list.toarray())

0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.