Exploring, Describing, and Hashing the Text

Introduction

The field of text mining or textanalytics is evolving at an alarming rate. Techniques and technologies are improving faster than many experts can keep up with. This chapter will introduce you to the set of tools designed in Azure ML Studio for text analytics. But there are others that you should explore as well. We'll use the Twitter dataset for this chapter. Let's begin with some basics:

Twitter dataset

Azure ML Studio has several of the traditional descriptive analyses available. These tools are good for understanding the text. Follow along with each of these videos to learn how to describe and understand the text you want to analyze.

Detect Language and Preprocess Text

Just as with traditional numeric, ordinal, and nominal data, the first step to analyzing text data is to clean and normalize it. Follow along with the video below to learn how this is done in Azure ML Studio by learning the Detect Language pill and the Preprocess Text pill:

If you'd like a little more theory behind word normalization, lemmatization, or stemming, here's a decent (optional) video overview:

Named Entity Recognition

Named Entity Recognition (pill) is another common task in text analytics. As before, first follow along with the video below to learn how it is accomplished in Azure ML Studio. Then, if you are interested, watch the next (optional) video to learn the theory behind Named Entity Recognition.

Extract Key Phrases (pill)

Feature Hashing (pill)