← Back to the training program

Paco Nathan


Daniel Vila


Get Started with Natural Language Processing in Python

Morning | 09:00 - 13:30


Python provides a number of excellent packages for natural language processing (NLP) along with great ways to leverage the results. This course builds on spaCy, datasketch, word2vec, and other popular libraries.

If you are new to NLP, this course will provide you with initial hands-on work: the confidence to explore much further into use of deep learning with text, natural language generation, chatbots, etc. First, however, we will show how to prepare text for parsing, how to extract key phrases, prepare text for indexing in search, calculate similarity between documents, etc. That provides a great starting point for developing custom search, content recommenders, applications of AI, etc.

NB: This is an expanded, full-day version of a popular NLP course which I teach online.

Participants will understand…
-Benefits of using Python for NLP applications
-How statistical parsing works
-How resources such as WordNet enhance text mining
-How to extract more than just a list of keywords from a text
-How to summarize and compare a set of documents
-How deep learning gets used with natural language
-Participants will be able to…
-Prepare texts for parsing, e.g., how to handle difficult Unicode
-Detect languages directly from the text
-Parse sentences into annotated lists, structured as JSON output
-Perform keyword ranking using TF-IDF, while filtering stop words
-Calculate a Jaccard similarity measure to compare texts
-Apply graph algorithms on a document corpus
-Leverage probabilistic data structures to perform calculations more efficiently
-Develop and leverage a knowledge graph related to a document corpus
-Use Jupyter notebooks for sharing all of the above within teams

Common Misunderstandings:
-What are 2-3 of the most common ideas, skills, or performance abilities that someone new to this content struggles with?
-How keyword analysis, n-grams, co-occurrence, stemming, and other techniques from a previous generation of NLP tools are no longer the best approaches to use.
-Whether NLP requires Big Data tooling and use of clusters; instead, we’ll show practical applications on a laptop.
-That NLP work leading into AI applications is either fully automated or something which requires a huge amount of manual work; instead we’ll demonstrate “human-in-the-loop” practices that make the best of both people skills and automation


·Some programming in Python (we’ll use Python 3) – for example, be comfortable with the material in Introduction to Python
·Basic understanding of HTML and the DOM structure for web pages – for example, be comfortable with the material in Modern Web Development with HTML5 and CSS
·Access to a computer with a browser, where you can install Python packages and develop code at a command line; in some cases, it may help to use virtualenv
·Know how to install Python libraries using PIP, etc.
·Basic familiarity with Git and use of GitHub – see Introducing GitHub if needed

Downloads required in advance of the course:

·Python 3
·Install Jupyter
·Install BeautifulSoup4, TextBlob, spaCy, datasketch, gensim, networkx, PyTextRank

We will provide a GitHub link to everyone who registers for this course, which includes detailed instructions for setup, plus Jupyter notebooks for each of the course exercises and a Docker container with all of the required libraries and data sets pre-loaded.


Big Data Spain will issue the certificate for this course


·You are a Python programmer and need to learn how to use available NLP packages
·You are a data scientist with some Python experience and need to leverage NLP and text mining
·You are interested in deep learning, chatbots, knowledge graphs, and related AI work, and want to understand the basics for preparing text data for those kinds of use cases

Bio of the instructor:

Director, Learning Group @ O'Reilly Media. Known as a "player/coach" data scientist, he has led innovative Data teams building large-scale apps for several years. As a recognized expert in distributed systems, machine learning, and Enterprise data workflows, Paco is also an advisor for Amplify Partners. He has 30+ years technology industry experience ranging from Bell Labs to early-stage start-ups.

Bio of the instructor:

Daniel Vila is co-founder of, a Madrid-based startup and spin-off from the Technical University of Madrid, building next generation solutions for text analytics and content management using the latest AI techniques. Daniel holds a PhD in Artificial Intelligence by the Technical University of Madrid (2016) and has built one the largest knowledge graphs in Spain combining NLP and semantic technologies: powering the data service from the National Library of Spain. He also received the Fujitsu Laboratories of Europe Innovation Award in 2014 and is a stable contributor to spaCy, one of the most advanced industrial libraries for NLP in Python.