Duplicated content is often an issue in many online platforms, especially in classifieds, where it requires special attention: fraudsters there create fake ads by copying existing listings and trick honest users into giving away their money.

In this talk, we suggest an approach to this problem and show how to design a system for duplicate detection in online classifieds. We cover it end-to-end and first present the basic conceptual ideas and then go into hands-on implementation details.

In the conceptual part, we talk about the general approach to the duplicate detection problem and explain how to use both images and texts to identify potential duplicates and how to apply machine learning to make sure the results are accurate.

To successfully stop fraudsters, a duplicate detection system has to be able to process millions of items daily while obeying to very strict speed requirements. This is why in the implementation part we discuss how to design the system to sustain high load of 10 million listings daily while always keeping the response time under one second. We show how to build it using python, AWS, elasticsearch, keras and other libraries.

Alexey Grigorev

Affiliation: OLX

Alexey is an experienced Software Engineer with focus on Machine Learning. Currently he works at OLX Group as a Senior Data Scientist where he mostly deals with content moderation and image models. He has been doing software engineering professionally for more than 10 years, 6 of which he spent working with Machine Learning.

Alexey wrote a couple of books, including Mastering Java for Data Science, and successfully participated in data science competitions in the past.

visit the speaker at: GithubHomepage