Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

Java1234官方群25:java1234官方群17
Java1234官方群25:838462530
        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

springcloud分布式电商秒杀实战课程

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!
当前位置: 主页 > Java文档 > Java基础相关 >

MapReduce数据密集型文本处理 PDF 下载


分享到:
时间:2021-04-22 10:06来源:http://www.java1234.com 作者:转载  侵权举报
MapReduce数据密集型文本处理 PDF 下载
失效链接处理
MapReduce数据密集型文本处理 PDF 下载


本站整理下载:
提取码:wzrx 
 
 
相关截图:
 
主要内容:


4 CHAPTER 1. INTRODUCTION
everything from understanding political discourse in the blogosphere to predicting the
movement of stock prices.
There is a growing body of evidence, at least in text processing, that of the three
components discussed above (data, features, algorithms), data probably matters the
most. Superficial word-level features coupled with simple models in most cases trump
sophisticated models over deeper features and less data. But why can’t we have our cake
and eat it too? Why not both sophisticated models and deep features applied to lots of
data? Because inference over sophisticated models and extraction of deep features are
often computationally intensive, they don’t scale well.
Consider a simple task such as determining the correct usage of easily confusable
words such as “than” and “then” in English. One can view this as a supervised machine
learning problem: we can train a classifier to disambiguate between the options, and
then apply the classifier to new instances of the problem (say, as part of a grammar
checker). Training data is fairly easy to come by—we can just gather a large corpus of
texts and assume that most writers make correct choices (the training data may be noisy,
since people make mistakes, but no matter). In 2001, Banko and Brill [14] published
what has become a classic paper in natural language processing exploring the e↵ects
of training data size on classification accuracy, using this task as the specific example.
They explored several classification algorithms (the exact ones aren’t important, as we
shall see), and not surprisingly, found that more data led to better accuracy. Across
many di↵erent algorithms, the increase in accuracy was approximately linear in the
log of the size of the training data. Furthermore, with increasing amounts of training
data, the accuracy of di↵erent algorithms converged, such that pronounced di↵erences
in e↵ectiveness observed on smaller datasets basically disappeared at scale. This led to
a somewhat controversial conclusion (at least at the time): machine learning algorithms
really don’t matter, all that matters is the amount of data you have. This resulted in
an even more controversial recommendation, delivered somewhat tongue-in-cheek: we
should just give up working on algorithms and simply spend our time gathering data
(while waiting for computers to become faster so we can process the data).
As another example, consider the problem of answering short, fact-based questions
such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the
user would then have to sort through, a question answering (QA) system would directly
return the answer: John Wilkes Booth. This problem gained interest in the late 1990s,
when natural language processing researchers approached the challenge with sophisti￾cated linguistic processing techniques such as syntactic and semantic analysis. Around
2001, researchers discovered a far simpler approach to answering such questions based
on pattern matching [27, 53, 92]. Suppose you wanted the answer to the above question.
As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the
web and look for what appears to its left. Or better yet, look through multiple instances
5
of this phrase and tally up the words that appear to the left. This simple strategy works
surprisingly well, and has become known as the redundancy-based approach to question
answering. It capitalizes on the insight that in a very large text collection (i.e., the
web), answers to commonly-asked questions will be stated in obvious ways, such that
pattern-matching techniques suffice to extract answers accurately.

 

------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐