QUESTION
A data scientist is working on a machine learning model to improve relevance in search results. As a benchmark, the scientist is using SKLearn’s TF-IDF implementation.
However, at first glance, the td-idf values aren’t looking as expected. Which of the following best explains why the values might different from the scientists initial calculations? And, given the following corpus, what are the dimensions for the tf-idf matrix if only bigrams are selected.
Document1 : product managers know about machine learning
Document2: machine learning, a product owner essential
(Choose One)
Sklearn’s TfidfVectorizer uses non default idf(t) for both the default smoothed and non default idf. L2 normalisation is also used. The code example below shows the default, which if you know idf is typically ln(number of docs/docs with this term), you’d be expecting some zeros.
There are 2 documents (sentences) and 8 possible bigrams (listed below) with ‘a’ excluded because it’s shorter than 2 character minimum. So the resulting tf-idf is 2,8
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
Example Code –
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [“product managers know about machine learning”,
“machine learning, a product owner essential”]
vectorizer_A = TfidfVectorizer(ngram_range = (2,2))
vectorizer_B = TfidfVectorizer(ngram_range = (2,2),smooth_idf=False, norm=None)
matrix = vectorizer_A.fit_transform(corpus)
print(“**list of bigrams**”)
for i, feature in enumerate(vectorizer_A.get_feature_names()):
print(i, feature)
matrix_A = vectorizer_A.fit_transform(corpus)
matrix_B = vectorizer_B.fit_transform(corpus)
print(“**bigram tf-idf with defaults values**” )
print(matrix_A)
print(“**bigram tf-idf with no smoothing or normalisation**” )
print(matrix_B)
**list of bigrams**0 about machine1 know about2 learning product3 machine learning4 managers know5 owner essential6 product managers7 product owner**bigram tf-idf with defaults values** (0, 3) 0.33517574332792605 (0, 0) 0.47107781233161794 (0, 1) 0.47107781233161794 (0, 4) 0.47107781233161794 (0, 6) 0.47107781233161794 (1, 5) 0.534046329052269 (1, 7) 0.534046329052269 (1, 2) 0.534046329052269 (1, 3) 0.37997836159100784**bigram tf-idf with no smoothing or normalisation** (0, 3) 1.0 (0, 0) 1.6931471805599454 (0, 1) 1.6931471805599454 (0, 4) 1.6931471805599454 (0, 6) 1.6931471805599454 (1, 5) 1.6931471805599454 (1, 7) 1.6931471805599454 (1, 2) 1.6931471805599454 (1, 3) 1.0
3 Responses
How can you use mean for imputation if the variable is categorical?
For this question……
A data scientist is working with a small dataset of 1000 rows and 6 features (labelled f1, f2 … f6)
Feature F2 is categorical and has about 80 entries with no value set. The data is going to be used to create a simple regression model.
How should the missing values be handled?
Drop rows with missing data
Use the mean for this feature for missing values
Use KNN to determine average values to replace missing value.
Use a deep learning to impute the missing values
Hello Sanket,
Many thanks for taking the time to provide feedback. Yes, you’re right and I’ve corrected the question.