Deduplication, near duplicate: a short guide

  • Jaccard similarity (Direct tokens matching), doesn’t perform well, because it doesn’t capture context, even the different synonyms are the problem here.
  • Vector-based similarity (cosine similarity, any distance-based):
    * TF-IDF vectors
    * Different vectors based (transformers/ fasttext/ word2vec…. etc)
  • We can leverage the zero-shot models
  • Train transformer using different loss functions (e.g, triplet loss functions)


>>> import hashlib
>>> hashlib.md5("text to encode".encode()).hexdigest()
!pip install transformers
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer 
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('bert-base-nli-mean-tokens')def check_similarity(sent1, sent2):
sent_vec_1 = model.encode(sent1)
sent_vec_2 = model.encode(sent2)
sent_score = cosine_similarity([sent_vec_1],[sent_vec_2]).tolist()[0][0]
if sent_score > 0.5:
print(f"The following: {sent1 } and {sent2 } are :: SIMILAR")
print(f"The following: {sent1} and {sent2} are :: DISSIMILAR")
sentence_1 = "I like to eat when I travel"
sentence_2 = "I like walking when I travel"
check_similarity(sent1=sentence_1, sent2=sentence_2)check_similarity(sent1=sentence_1, sent2=sentence_2)
from transformers import pipeline
classifier = pipeline("zero-shot-classification",

sentence_1 = "I like to eat when I travel"
sentences_to_match = ["I like walking when I travel", "I like to play cricket"]
print(classifier(sentence_1, sentences_to_match))
{'labels': ['I like walking when I travel', 'I like to play cricket'], 'scores': [0.6023975610733032, 0.397602379322052],
'sequence': 'I like to eat when I travel'}




Aquib J. Khan

