失效链接处理 |
MapReduce数据密集型文本处理 PDF 下载
本站整理下载:
相关截图:
![]()
主要内容:
4 CHAPTER 1. INTRODUCTION
everything from understanding political discourse in the blogosphere to predicting the
movement of stock prices.
There is a growing body of evidence, at least in text processing, that of the three
components discussed above (data, features, algorithms), data probably matters the
most. Superficial word-level features coupled with simple models in most cases trump
sophisticated models over deeper features and less data. But why can’t we have our cake
and eat it too? Why not both sophisticated models and deep features applied to lots of
data? Because inference over sophisticated models and extraction of deep features are
often computationally intensive, they don’t scale well.
Consider a simple task such as determining the correct usage of easily confusable
words such as “than” and “then” in English. One can view this as a supervised machine
learning problem: we can train a classifier to disambiguate between the options, and
then apply the classifier to new instances of the problem (say, as part of a grammar
checker). Training data is fairly easy to come by—we can just gather a large corpus of
texts and assume that most writers make correct choices (the training data may be noisy,
since people make mistakes, but no matter). In 2001, Banko and Brill [14] published
what has become a classic paper in natural language processing exploring the e↵ects
of training data size on classification accuracy, using this task as the specific example.
They explored several classification algorithms (the exact ones aren’t important, as we
shall see), and not surprisingly, found that more data led to better accuracy. Across
many di↵erent algorithms, the increase in accuracy was approximately linear in the
log of the size of the training data. Furthermore, with increasing amounts of training
data, the accuracy of di↵erent algorithms converged, such that pronounced di↵erences
in e↵ectiveness observed on smaller datasets basically disappeared at scale. This led to
a somewhat controversial conclusion (at least at the time): machine learning algorithms
really don’t matter, all that matters is the amount of data you have. This resulted in
an even more controversial recommendation, delivered somewhat tongue-in-cheek: we
should just give up working on algorithms and simply spend our time gathering data
(while waiting for computers to become faster so we can process the data).
As another example, consider the problem of answering short, fact-based questions
such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the
user would then have to sort through, a question answering (QA) system would directly
return the answer: John Wilkes Booth. This problem gained interest in the late 1990s,
when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. Around
2001, researchers discovered a far simpler approach to answering such questions based
on pattern matching [27, 53, 92]. Suppose you wanted the answer to the above question.
As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the
web and look for what appears to its left. Or better yet, look through multiple instances
5
of this phrase and tally up the words that appear to the left. This simple strategy works
surprisingly well, and has become known as the redundancy-based approach to question
answering. It capitalizes on the insight that in a very large text collection (i.e., the
web), answers to commonly-asked questions will be stated in obvious ways, such that
pattern-matching techniques suffice to extract answers accurately.
|