| 失效链接处理 | 
| MapReduce数据密集型文本处理 PDF 下载 
	本站整理下载: 
	相关截图:  
	主要内容: 
		4 CHAPTER 1. INTRODUCTION 
		everything from understanding political discourse in the blogosphere to predicting the 
		movement of stock prices. 
		There is a growing body of evidence, at least in text processing, that of the three 
		components discussed above (data, features, algorithms), data probably matters the 
		most. Superficial word-level features coupled with simple models in most cases trump 
		sophisticated models over deeper features and less data. But why can’t we have our cake 
		and eat it too? Why not both sophisticated models and deep features applied to lots of 
		data? Because inference over sophisticated models and extraction of deep features are 
		often computationally intensive, they don’t scale well. 
		Consider a simple task such as determining the correct usage of easily confusable 
		words such as “than” and “then” in English. One can view this as a supervised machine 
		learning problem: we can train a classifier to disambiguate between the options, and 
		then apply the classifier to new instances of the problem (say, as part of a grammar 
		checker). Training data is fairly easy to come by—we can just gather a large corpus of 
		texts and assume that most writers make correct choices (the training data may be noisy, 
		since people make mistakes, but no matter). In 2001, Banko and Brill [14] published 
		what has become a classic paper in natural language processing exploring the e↵ects 
		of training data size on classification accuracy, using this task as the specific example. 
		They explored several classification algorithms (the exact ones aren’t important, as we 
		shall see), and not surprisingly, found that more data led to better accuracy. Across 
		many di↵erent algorithms, the increase in accuracy was approximately linear in the 
		log of the size of the training data. Furthermore, with increasing amounts of training 
		data, the accuracy of di↵erent algorithms converged, such that pronounced di↵erences 
		in e↵ectiveness observed on smaller datasets basically disappeared at scale. This led to 
		a somewhat controversial conclusion (at least at the time): machine learning algorithms 
		really don’t matter, all that matters is the amount of data you have. This resulted in 
		an even more controversial recommendation, delivered somewhat tongue-in-cheek: we 
		should just give up working on algorithms and simply spend our time gathering data 
		(while waiting for computers to become faster so we can process the data). 
		As another example, consider the problem of answering short, fact-based questions 
		such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the 
		user would then have to sort through, a question answering (QA) system would directly 
		return the answer: John Wilkes Booth. This problem gained interest in the late 1990s, 
		when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. Around 
		2001, researchers discovered a far simpler approach to answering such questions based 
		on pattern matching [27, 53, 92]. Suppose you wanted the answer to the above question. 
		As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the 
		web and look for what appears to its left. Or better yet, look through multiple instances 
		5 
		of this phrase and tally up the words that appear to the left. This simple strategy works 
		surprisingly well, and has become known as the redundancy-based approach to question 
		answering. It capitalizes on the insight that in a very large text collection (i.e., the 
		web), answers to commonly-asked questions will be stated in obvious ways, such that 
		pattern-matching techniques suffice to extract answers accurately. | 



 
     苏公网安备 32061202001004号
苏公网安备 32061202001004号


 
    