Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

Java1234官方群25:java1234官方群17
Java1234官方群25:838462530
        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

springcloud分布式电商秒杀实战课程

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!
当前位置: 主页 > Java文档 > Java基础相关 >

Natural Language Processing with Python PDF 下载


分享到:
时间:2020-09-17 09:26来源:http://www.java1234.com 作者:小锋  侵权举报
Natural Language Processing with Python PDF 下载
失效链接处理
Natural Language Processing with Python PDF 下载


本站整理下载:
 
相关截图:
 
主要内容:

1.3 Computing with Language: Simple Statistics
Let’s return to our exploration of the ways we can bring our computational resources
to bear on large quantities of text. We began this discussion in Section 1.1, and saw
how to search for words in context, how to compile the vocabulary of a text, how to
generate random text in the same style, and so on.
In this section, we pick up the question of what makes a text distinct, and use automatic
methods to find characteristic words and expressions of a text. As in Section 1.1, you
can try new features of the Python language by copying them into the interpreter, and
you’ll learn about these features systematically in the following section.
Before continuing further, you might like to check your understanding of the last sec￾tion by predicting the output of the following code. You can use the interpreter to check
whether you got it right. If you’re not sure how to do this task, it would be a good idea
to review the previous section before continuing further.
>>> saying = ['After', 'all', 'is', 'said', 'and', 'done',
... 'more', 'is', 'said', 'than', 'done']
>>> tokens = set(saying)
>>> tokens = sorted(tokens)
>>> tokens[-2:]
what output do you expect here?
>>>
16 | Chapter 1: Language Processing and Python
Frequency Distributions
How can we automatically identify the words of a text that are most informative about
the topic and genre of the text? Imagine how you might go about finding the 50 most
frequent words of a book. One method would be to keep a tally for each vocabulary
item, like that shown in Figure 1-3. The tally would need thousands of rows, and it
would be an exceedingly laborious process—so laborious that we would rather assign
the task to a machine.
Figure 1-3. Counting words appearing in a text (a frequency distribution).
The table in Figure 1-3 is known as a frequency distribution , and it tells us the
frequency of each vocabulary item in the text. (In general, it could count any kind of
observable event.) It is a “distribution” since it tells us how the total number of word
tokens in the text are distributed across the vocabulary items. Since we often need
frequency distributions in language processing, NLTK provides built-in support for
them. Let’s use a FreqDist to find the 50 most frequent words of Moby Dick. Try to
work out what is going on here, then read the explanation that follows.
>>> fdist1 = FreqDist(text1) 
>>> fdist1 
<FreqDist with 260819 outcomes>
>>> vocabulary1 = fdist1.keys() 
>>> vocabulary1[:50] 
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-',
'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for',
'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on',
'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were',
'now', 'which', '?', 'me', 'like']
>>> fdist1['whale']
906
>>>
When we first invoke FreqDist, we pass the name of the text as an argument . We
can inspect the total number of words (“outcomes”) that have been counted up —
260,819 in the case of Moby Dick. The expression keys() gives us a list of all the distinct
types in the text , and we can look at the first 50 of these by slicing the list .
1.3 Computing with Language: Simple Statistics | 17
Your Turn: Try the preceding frequency distribution example for yourself, for text2. Be careful to use the correct parentheses and uppercase
letters. If you get an error message NameError: name 'FreqDist' is not
defined, you need to start your work with from nltk.book import *.
Do any words produced in the last example help us grasp the topic or genre of this text?
Only one word, whale, is slightly informative! It occurs over 900 times. The rest of the
words tell us nothing about the text; they’re just English “plumbing.” What proportion
of the text is taken up with such words? We can generate a cumulative frequency plot
for these words, using fdist1.plot(50, cumulative=True), to produce the graph in
Figure 1-4. These 50 words account for nearly half the book!

 

------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐