Where Linguistics meets Natural Language Processing Mariana Capinel PyConDE & PyDataBerlin 2019 conference

Where Linguistics meets Natural Language Processing

Mariana Capinel

Wednesday 14:00 in Saal 6 wednesday wednesday-1400

Type/Track Talk PyData

For someone working with Natural Language Processing/Understanding (NLP/NLU), I see a lot of value in incorporating a formal understanding of how languages are structured, beyond just being able to speak/understand them. In this talk I will give you a simple explanation of the basic concepts and their connection with the NLP/NLU world.

According to traditional linguistics, there are 5 levels of study for languages: phonetics-phonology, morphology, syntax, semantics and pragmatics. They go from the smallest unit in language, the human sounds, to the largest, language usage. We will go through all of them in the talk.

In NLP/NLU we use models for different tasks, e.g. , language understanding, topic modelling, sentiment analysis and chatbots. One of those models is the popular word2vec, which produces word embeddings. Each word’s embedding or representation is generated by using the word’s context, or set of nearby words. Pragmatics is used since the representation of the word is given by its context and semantics since each embedding represents the literal meaning of a word.

As we can see in this example, word embeddings use multiple linguistic concepts to analyze words. By combining many contextual word mappings a pragmatics based approach is used. A semantics based approach is also used because each mapping represents the literal interpretation of the word.

When we get to understand in which language layer we need to work to reach our goal, then it is easier to recognize the tool we need to use for each task. This talk introduces each of these layers, so a new data scientist can better navigate in the NLP ecosystem.

Tags Natural Language Processing

Level Domain Expertise none Python Skill Level none

Mariana Capinel

Data scientist with background in Linguistics.