In this section, we look at the maj

In this section, we look at the major steps involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation.
Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users
believe the data are dirty, they are unlikely to trust the results of any data mining that has
been applied. Furthermore, dirty data can cause confusion for the mining procedure,
resulting in unreliable output. Although most mining routines have some procedures
for dealing with incomplete or noisy data, they are not always robust. Instead, they may
concentrate on avoiding overfitting the data to the function being modeled. Therefore,
a useful preprocessing step is to run your data through some data cleaning routines.
Section 3.2 discusses methods for data cleaning.
Getting back to your task at AllElectronics, suppose that you would like to include
data from multiple sources in your analysis. This would involve integrating multiple
databases, data cubes, or files (i.e., data integration). Yet some attributes representing a
86 Chapter 3 Data Preprocessing
given concept may have different names in different databases, causing inconsistencies
and redundancies. For example, the attribute for customer identification may be referred
to as customer id in one data store and cust id in another. Naming inconsistencies may
also occur for attribute values. For example, the same first name could be registered as
“Bill” in one database, “William” in another, and “B.” in a third. Furthermore, you suspect that some attributes may be inferred from others (e.g., annual revenue). Having
a large amount of redundant data may slow down or confuse the knowledge discovery process. Clearly, in addition to data cleaning, steps must be taken to help avoid
redundancies during data integration. Typically, data cleaning and data integration are
performed as a preprocessing step when preparing data for a data warehouse. Additional data cleaning can be performed to detect and remove redundancies that may have
resulted from data integration.
“Hmmm,” you wonder, as you consider your data even further. “The data set I have
selected for analysis is HUGE, which is sure to slow down the mining process. Is there a
way I can reduce the size of my data set without jeopardizing the data mining results?”
Data re duct ion obtains a reduced representation of the data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results. Data reduction
strategies include dimensionality reduction and numerosity reduction.
In dimensionality reduction, data encoding schemes are applied so as to obtain a
reduced or “compressed” representation of the original data. Examples include data
compression techniques (e.g., wavelet transforms and principal components analysis),
attribute subset selection (e.g., removing irrelevant attributes), and attribute construction
(e.g., where a small set of more useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller representations using parametric models (e.g., regression or log-linear models) or nonparametric
models (e.g., histograms, clusters, sampling, or data aggregation). Data reduction is the
topic of Section 3.4.
Getting back to your data, you have decided, say, that you would like to use a distancebased mining algorithm for your analysis, such as neural networks, nearest-neighbor
classifiers, or clustering.1 Such methods provide better results if the data to be analyzed have been normalized, that is, scaled to a smaller range such as [0.0, 1.0]. Your
customer data, for example, contain the attributes age and annual salary. The annual
salary attribute usually takes much larger values than age. Therefore, if the attributes
are left unnormalized, the distance measurements taken on annual salary will generally
outweigh distance measurements taken on age. Discretization and concept hierarchy generation can also be useful, where raw data values for attributes are replaced by ranges or
higher conceptual levels. For example, raw values for age may be replaced by higher-level
concepts, such as youth, adult, or senior.
Discretization and concept hierarchy generation are powerful tools for data mining in that they allow data mining at multiple abstraction levels. Normalization, data
1 Neural networks and nearest-neighbor classifiers are described in Chapter 9, and clustering is discussed
in Chapters 10 and 11.
3.2 Data Prepro cessing: An Over v iew 87
discre t izat ion, and concep t hier archy gener at ion are for ms of data transformation.
You soon realize such data transformation operations are additional data preprocessing
procedures that would contribute toward the success of the mining process. Data
integration and data discretization are discussed in

0/5000

From: -

To: -

Results (English) 1: [Copy]

Copied!

In this section, we look at the major steps involved in data preprocessing, namely, datacleaning, data integration, data reduction, and data transformation.Data cleaning routines work to "clean" the data by filling in the missing values, the smoothing noisy data, identifying or removing outliers, and Registrar inconsistencies. If usersbelieve the data are dirty, they are unlikely to trust the results of any data mining that hasbeen applied. Furthermore, dirty data can cause confusion for the mining procedure,resulting in unreliable output. Although most mining routines have some proceduresfor dealing with incomplete or noisy data, they are not always robust. Instead, they mayconcentrate on avoiding overfitting the data to the function being modeled. Therefore,a useful preprocessing step is to run your data through some data cleaning routines.Section 3.2 discusses methods for data cleaning.Getting back to your task at AllElectronics, suppose that you would like to includedata from multiple sources in your analysis. This would involve integrating multipledatabases, cubes, data or files (i.e., data integration). Yet some attributes representing a86 Chapter 3 Data Preprocessinggiven concept may have different names in different databases, causing inconsistenciesand redundancies. For example, the attribute for customer identification may be referredto us the customer id in one data store and cust id in another. Naming inconsistencies mayalso occur.4womenonly for attribute values. For example, the same first name could be registered u.s."Bill" in one database, "William" in another, and "B." in a third. Furthermore, you suspect that some attributes may be inferred from others (e.g., annual revenue). Havinga large amount of redundant data may slow down or confuse the knowledge discovery process. Clearly, in addition to data cleaning, steps must be taken to help • avoidredundancies during the data integration. Typically, data cleaning and data integration areperformed as a preprocessing step when preparing data for a data warehouse. Additional data cleaning can be performed to detect and remove redundancies that may haveresulted from the data integration."Hmmm," you wonder, as you consider your data even further. "The data set I haveselected for analysis is HUGE, which is sure to slow down the mining process. Is there away I can reduce the size of my data sets without jeopardizing the data mining results? "The data re duct ion obtains a reduced representation of the data set that is much smaller involume, yet produces the same (or almost the same) analytical results. Data reductionstrategies include dimensionality reduction and numerosity reduction.In dimensionality reduction, data encoding schemes are applied so as to obtain areduced or "compressed" representation of the original data. Examples include datacompression techniques (e.g., wavelet transforms and principal components analysis),attribute subset selection (e.g., removing irrelevant attributes), and attribute construction(e.g., where a small set of more useful attributes is derived from the original set).In numerosity reduction, the data are replaced by an alternate, smaller representations using parametric models (e.g., regression or log-linear models) or nonparametricmodels (e.g., histograms, files, data aggregation, or sampling). Data reduction is thethe topic of Section 2.1.Getting back to your data, you have decided, say, that you would like to use a distancebased mining algorithm for your analysis, such as neural networks, nearest-neighborclassifiers, or clustering 1 Such methods provide better results if the data to be analyzed have been normalized, that is, scaled to a smaller range such as the [0.0, 1.0]. Yourcustomer data, for example, contain the attributes age and annual salary. The annualsalary attribute usually takes much larger values than age. Therefore, if the attributesare unnormalized, left the distance measurements taken on annual salary will generallyoutweigh distance measurements taken on age. Discretization and concept hierarchy generation can also be useful, where raw data values for attributes are replaced by ranges orhigher conceptual levels. For example, raw values for age may be replaced by higher-levelconcepts, such as youth, adult, or senior.Discretization and concept hierarchy generation are powerful tools for data mining in that they allow data mining at multiple abstraction levels. Normalization, data1 Neural networks and nearest-neighbor classifiers are described in Chapter 9, and clustering is discussedin Chapters 10 and 11.3.2 Data Prepro cessing: An Over v iew 87discre t izat ion, and concep t hier archy gener at ion are for ms of data transformation.You soon realize such data transformation operations are additional data preprocessingprocedures that would contribute toward the success of the mining process. Dataintegration and data discretization are discussed in

Being translated, please wait..

Results (English) 2:[Copy]

Copied!

Being translated, please wait..

Results (English) 3:[Copy]

Copied!

Being translated, please wait..

Other languages

The translation tool support: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Catalan, Cebuano, Chichewa, Chinese, Chinese Traditional, Corsican, Croatian, Czech, Danish, Detect language, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Frisian, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar (Burmese), Nepali, Norwegian, Odia (Oriya), Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scots Gaelic, Serbian, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Turkmen, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, Xhosa, Yiddish, Yoruba, Zulu, Language translation.