Deduplication, near duplicate: a short guide
Hola Everyone!,
In this short and brief blog post, I will talk about a few methods and “out of the box” solutions to deal with deduplication and near-duplicated data.
So, why is this important? if we see from the ML model training point of view, ML models do not generalize well, when you have duplicate data in the training dataset.
Okie, so let’s start now,
There are many ways to calculate the similarity between two sentences/documents, let’s a have a look at the following:
- Jaccard similarity (Direct tokens matching), doesn’t perform well, because it doesn’t capture context, even the different synonyms are the problem here.
- Vector-based similarity (cosine similarity, any distance-based):
* TF-IDF vectors
* Different vectors based (transformers/ fasttext/ word2vec…. etc) - We can leverage the zero-shot models
- Train transformer using different loss functions (e.g, triplet loss functions)
Hashing
In order to find exact data (duplicate data), we can use hashing (md5) and then group by.
>>> import hashlib
>>> hashlib.md5("text to encode".encode()).hexdigest()
'07161b7ecf4ca0760c721d6b8b9badfb'
For images: https://github.com/idealo/imagededup
Let’s have a look at quick near-duplicate/similar documents with the “out of the box solutions”
!pip install transformers
!pip install sentence-transformers
after installation,
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similaritymodel = SentenceTransformer('bert-base-nli-mean-tokens')def check_similarity(sent1, sent2):
sent_vec_1 = model.encode(sent1)
sent_vec_2 = model.encode(sent2)
sent_score = cosine_similarity([sent_vec_1],[sent_vec_2]).tolist()[0][0]
if sent_score > 0.5:
print(f"The following: {sent1 } and {sent2 } are :: SIMILAR")
print(f"The following: {sent1} and {sent2} are :: DISSIMILAR")sentence_1 = "I like to eat when I travel"
sentence_2 = "I like walking when I travel" check_similarity(sent1=sentence_1, sent2=sentence_2)check_similarity(sent1=sentence_1, sent2=sentence_2)
The following:
I like to eat when I travel and I like walking when I travel are :: SIMILAR
The following:
I like to eat when I travel and I like to play cricket are :: DISSIMILAR
The above was simple transformer embeddings and cosine similarity, To improve the quality we can train model using triplet loss function to get better quality of embeddings.
We can leverage zero-shot as well,
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
sentence_1 = "I like to eat when I travel"
sentences_to_match = ["I like walking when I travel", "I like to play cricket"]print(classifier(sentence_1, sentences_to_match))
-------------------------------------------------{'labels': ['I like walking when I travel', 'I like to play cricket'], 'scores': [0.6023975610733032, 0.397602379322052],
'sequence': 'I like to eat when I travel'}
The above were the two methods for similarity “out of the box solution”, with somefine tuning it can perform well