官术网_书友最值得收藏!

Transforming Text into Data Structures

Text data offers a very unique proposition by not providing any direct representation available for it in terms of numbers. Computers only understand numbers. Representing text using numbers is a challenge. At the same time, it is an opportunity to invent and try out approaches to represent text so that the maximum information can be captured in the process. In this chapter, we will look at how text and math interface. Let's take baby steps toward transforming text data into mathematical data structures that will provide insights on how to actually represent text using numbers and, consequently, build Natural Language Processing (NLP) models.

Pause for a moment here and dwell on how would you try to solve it.

As we progress toward the end of this chapter, we will be better equipped to handle text data as we understand techniques including count vectorization and term frequency-inverse document frequency (TF-IDF) vectorization, among others.

Before we proceed and discuss various possible approaches such as count vectors and TF-IDF vectors in this chapter and more approaches such as Word2vec in future chapters, we need to understand two supremely important concepts that validate every language. These are syntax and semantics. Syntax defines the grammatical structures or the set of rules defining a language. It can be thought of as a set of guiding principles that define how words can be put in each other's vicinity to form sentences or phrases. However, syntactically correct sentences may not be meaningful. Semantics is the part that takes care of the meanings and defines how to put words together so that they actually make sense when organized based on the available syntactical rules.

In this chapter, we will primarily focus on the syntactical aspects, where we use information such as how many times a word occurred in a document or in a set of documents as potential features to represent documents. Let's see how these approaches pan out in solving the representation problem we have.

The following topics will be covered in this chapter:

  • Understanding vectors and matrices
  • Exploring the Bag-of-Words (BoW) architecture
  • TF-IDF vectors
  • Distance/similarity calculation between document vectors
  • One-hot vectorization
  • Building a basic chatbot
主站蜘蛛池模板: 江永县| 宜春市| 秦安县| 云林县| 南通市| 洪雅县| 彩票| 凤城市| 东光县| 新乡市| 罗山县| 榕江县| 龙里县| 余庆县| 新邵县| 寿宁县| 洛阳市| 兰溪市| 阳城县| 武胜县| 红桥区| 彝良县| 普宁市| 安乡县| 大田县| 久治县| 上栗县| 深水埗区| 琼海市| 利津县| 长白| 贵阳市| 江源县| 肥乡县| 江门市| 九龙坡区| 怀柔区| 天长市| 白水县| 颍上县| 沂南县|