Natural Language Processing (NLP) is the process of extracting information from textual data in a form that makes it computationally simple to power intelligence in different forms — for example: websites, apps, devices, decision making, etc. NLP leverages the structure and coherence in language to create representations that are useful in modeling and prediction tasks.
In this presentation, we will talk about the NLP based Machine Learning pipeline that we use at Chegg to extract knowledge from content and drive innovation in the student’s learning process.
The main components of the NLP and ML pipeline are weak supervision, transfer learning, active learning and thresholding. The initial goal of the NLP and ML pipeline is to create a knowledge base with a hierarchy of concepts associated with content generated by students and instructors. Collecting training data to generate different parts of the knowledgebase is a key bottleneck in developing NLP models. Employing subject matter experts to provide annotations is prohibitively expensive. Instead, we use weak supervision and active learning techniques, with tools such as Snorkel, an open source project from Stanford, to make training data generation dramatically easier.
In the past few years Deep Learning has provided an efficient way to build high performance models without the necessity of feature engineering. But Deep Learning models typically require a huge amount of training data. One way to apply Deep Learning to small datasets is to borrow and retrain the features learned using Deep Learning in a different domain – a process known as Transfer Learning (TL). I will discuss both the rapid development in TL for NLP in the past year, as well as our attempts in using both Open Sourced and in-house TL models.
W will also touch upon how to integrate these models into the product, a key step in which is the evangelization of these fairly technical ideas to key stakeholders at a high level.