Deduplication, near duplicate: a short guide

  • Jaccard similarity (Direct tokens matching), doesn’t perform well, because it doesn’t capture context, even the different synonyms are the problem here.
  • Vector-based similarity (cosine similarity, any distance-based):
    * TF-IDF vectors
    * Different vectors based (transformers/ fasttext/ word2vec…. etc)
  • We can leverage the zero-shot models
  • Train transformer using different loss functions (e.g, triplet loss functions)


>>> import hashlib
>>> hashlib.md5("text to encode".encode()).hexdigest()
!pip install transformers
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer 
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('bert-base-nli-mean-tokens')def check_similarity(sent1, sent2):
sent_vec_1 = model.encode(sent1)
sent_vec_2 = model.encode(sent2)
sent_score = cosine_similarity([sent_vec_1],[sent_vec_2]).tolist()[0][0]
if sent_score > 0.5:
print(f"The following: {sent1 } and {sent2 } are :: SIMILAR")
print(f"The following: {sent1} and {sent2} are :: DISSIMILAR")
sentence_1 = "I like to eat when I travel"
sentence_2 = "I like walking when I travel"
check_similarity(sent1=sentence_1, sent2=sentence_2)check_similarity(sent1=sentence_1, sent2=sentence_2)
from transformers import pipeline
classifier = pipeline("zero-shot-classification",

sentence_1 = "I like to eat when I travel"
sentences_to_match = ["I like walking when I travel", "I like to play cricket"]
print(classifier(sentence_1, sentences_to_match))
{'labels': ['I like walking when I travel', 'I like to play cricket'], 'scores': [0.6023975610733032, 0.397602379322052],
'sequence': 'I like to eat when I travel'}




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The architecture of Power BI

Who takes the final rose?

Impact of COVID-19 on Asian Americans in the Digital Sphere

What you can do to hedge for irregularity in the food & beverage supply chain

How to get started with “Machine Learning”?

Advanced Analytics Bringing New Possibilities For Public Sector

Dataclysm: A Book About People, by the Numbers

How “Confusion Matrix” and “ROC” in Salesforce Tableau help analyze predictions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aquib J. Khan

Aquib J. Khan

More from Medium

Twitter Data Labeling using distilled version of BERT (DistilBERT)

Resume parser using deep learning with RoBerta

Similarity Index and why ?