Deduplication, near duplicate: a short guide

Aquib J. Khan
2 min readMay 21, 2022

Hola Everyone!,

In this short and brief blog post, I will talk about a few methods and “out of the box” solutions to deal with deduplication and near-duplicated data.
So, why is this important? if we see from the ML model training point of view, ML models do not generalize well, when you have duplicate data in the training dataset.

Okie, so let’s start now,

There are many ways to calculate the similarity between two sentences/documents, let’s a have a look at the following:

  • Jaccard similarity (Direct tokens matching), doesn’t perform well, because it doesn’t capture context, even the different synonyms are the problem here.
  • Vector-based similarity (cosine similarity, any distance-based):
    * TF-IDF vectors
    * Different vectors based (transformers/ fasttext/ word2vec…. etc)
  • We can leverage the zero-shot models
  • Train transformer using different loss functions (e.g, triplet loss functions)

Hashing

In order to find exact data (duplicate data), we can use hashing (md5) and then group by.

>>> import hashlib
>>> hashlib.md5("text to encode".encode()).hexdigest()
'07161b7ecf4ca0760c721d6b8b9badfb'

For images: https://github.com/idealo/imagededup

Let’s have a look at quick near-duplicate/similar documents with the “out of the box solutions”

!pip install transformers
!pip install sentence-transformers

after installation,

from sentence_transformers import SentenceTransformer 
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('bert-base-nli-mean-tokens')def check_similarity(sent1, sent2):
sent_vec_1 = model.encode(sent1)
sent_vec_2 = model.encode(sent2)
sent_score = cosine_similarity([sent_vec_1],[sent_vec_2]).tolist()[0][0]
if sent_score > 0.5:
print(f"The following: {sent1 } and {sent2 } are :: SIMILAR")
print(f"The following: {sent1} and {sent2} are :: DISSIMILAR")
sentence_1 = "I like to eat when I travel"
sentence_2 = "I like walking when I travel"
check_similarity(sent1=sentence_1, sent2=sentence_2)check_similarity(sent1=sentence_1, sent2=sentence_2)

The following:
I like to eat when I travel and I like walking when I travel are :: SIMILAR

The following:
I like to eat when I travel and I like to play cricket are :: DISSIMILAR

The above was simple transformer embeddings and cosine similarity, To improve the quality we can train model using triplet loss function to get better quality of embeddings.

We can leverage zero-shot as well,

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")

sentence_1 = "I like to eat when I travel"
sentences_to_match = ["I like walking when I travel", "I like to play cricket"]
print(classifier(sentence_1, sentences_to_match))
-------------------------------------------------
{'labels': ['I like walking when I travel', 'I like to play cricket'], 'scores': [0.6023975610733032, 0.397602379322052],
'sequence': 'I like to eat when I travel'}

The above were the two methods for similarity “out of the box solution”, with somefine tuning it can perform well

--

--