Mining of Massive Datasets PDF 下载_Java知识分享网-免费Java资源下载

Mining of Massive Datasets PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1oLa-CJkNEcAJ3_xVKCDEXQ

提取码：1rbt

相关截图：

主要内容：

Data Mining

In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute

to this field. We cover “Bonferroni’s Principle,” which is really a warning about

overusing the ability to mine data. This chapter is also the place where we

summarize a few useful ideas that are not data mining but are useful in understanding some important data-mining concepts. These include the TF.IDF

measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of

the topics covered in the balance of the book.

1.1 What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of

“models” for data. A “model,” however, can be one of several things. We

mention below the most important directions in modeling.

1.1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.” Originally, “data

mining” or “data dredging” was a derogatory term referring to attempts to

extract information that was not supported by the data. Section 1.2 illustrates

the sort of errors one can make by trying to extract what really isn’t in the data.

Today, “data mining” has taken on a positive meaning. Now, statisticians view

data mining as the construction of a statistical model, that is, an underlying

distribution from which the visible data is drawn.

Example 1.1 : Suppose our data is a set of numbers. This data is much

simpler than data that would be data-mined, but it will serve as an example. A

statistician might decide that the data comes from a Gaussian distribution and

use a formula to compute the most likely parameters of this Gaussian. The mean

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. ✷

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netflix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machinelearning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs1

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best fits the data, although they

will almost certainly be very close if the size of the data is large.

There are many different approaches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

1This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

1.1. WHAT IS DATA MINING? 3

2. Extracting the most prominent features of the data and ignoring the rest.

We shall explore these two approaches in the following sections.

1.1.4 Summarization

One of the most interesting forms of summarization is the PageRank idea, which

made Google successful and which we shall cover in Chapter 5. In this form

of Web mining, the entire complex structure of the Web is summarized by a

single number for each page. This number, the “PageRank” of the page, is

(oversimplifying somewhat) the probability that a random walker on the graph

would be at that page at any given time. The remarkable property this ranking

has is that it reflects very well the “importance” of the page – the degree to

which typical searchers would like that page returned as an answer to their

search query.

Another important form of summary – clustering – will be covered in Chapter 7. Here, data is viewed as points in a multidimensional space. Points

that are “close” in this space are assigned to the same cluster. The clusters

themselves are summarized, perhaps by giving the centroid of the cluster and

the average distance from the centroid of points in the cluster. These cluster

summaries become the summary of the entire data set.

Example 1.2 : A famous instance of clustering to solve a problem took place

long ago in London, and it was done entirely without computers.2 The physician

John Snow, dealing with a Cholera outbreak plotted the cases on a map of the

city. A small illustration suggesting the process is shown in Fig. 1.1.

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

Mining of Massive Datasets PDF 下载