Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data warehousing and data mining dr.p.rizwan ahmed, Study notes of Data Mining

Text Book - Text Book

Typology: Study notes

2014/2015

Uploaded on 10/02/2015

Dr.Rizwan.Ahmed
Dr.Rizwan.Ahmed 🇮🇳

3.3

(29)

17 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
DATA WAREHOUSING
AND
DATA MINING
A Comprehensive guide for students and IT Professionals
(Choice Based Credit System (CBCS) Pattern) New Syllabus
( For B. Sc Computer Science, B.Sc., Software Computer Science, B.Sc. ISM, B.Sc. IT,
B.Sc. Software System, B.Sc. Software Engineering, BCA, M.Sc. Computer Science,
M.Sc. Information Technology, M.Sc. Information System and Management, M.Sc.
Software Engineering, MCA, B.E.CSE, B.Tech IT, M.E CSE, M.Tech IT, M.Phil., and
IT Professionals.)
By
Dr.P.Rizwan Ahmed, MCA,, M.Sc.,M.A.,M.Phil.,Ph.D,
Head of the Department
Department of Computer Applications and
PG Department of Information Technology
Mazharul Uloom College,
Ambur - 635 802, Vellore Dist. Tamil Nadu.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download Data warehousing and data mining dr.p.rizwan ahmed and more Study notes Data Mining in PDF only on Docsity!

D ATA W AREHOUSING

AND

D ATA M INING

A Comprehensive guide for students and IT Professionals

(Choice Based Credit System (CBCS) Pattern) – New Syllabus ( For B. Sc Computer Science, B.Sc., Software Computer Science, B.Sc. ISM, B.Sc. IT, B.Sc. Software System, B.Sc. Software Engineering, BCA, M.Sc. Computer Science, M.Sc. Information Technology, M.Sc. Information System and Management, M.Sc. Software Engineering, MCA, B.E.CSE, B.Tech IT, M.E CSE, M.Tech IT, M.Phil., and IT Professionals.)

By

Dr.P.Rizwan Ahmed, MCA,, M.Sc.,M.A.,M.Phil.,Ph.D,

Head of the Department Department of Computer Applications and PG Department of Information Technology Mazharul Uloom College, Ambur - 635 802, Vellore Dist. Tamil Nadu.

CONTENTS

Preface Acknowledgement

PART- I DATA MINING Chapter – 1 Introduction

1.1 An Expanding universe of data 1.2 Information and production factor 1.3 KDD and data mining 1.4 Data Mining vs query tools 1.5 Data Mining in Marketing 1.6 Practical applications of data mining 1.7 Learning 1.8 Self-learning computer systems 1.9 Machine learning 1.9.1 Why machine learning is done? 1.10 Machine learning and the methodology of science 1.10.1 Differences between Data Mining and Machine Learning 1.11 Concept Learning Summary Review Question

Chapter – 2 Data Mining and the Data Warehouse

2.1 Data Warehouse: Definitions 2.2 Why do we need Data Warehouse? 2.3 Designing decision support systems 2.3.1Hardware and software products of a decision support system 2.4 Integration with data mining 2.5 Client/server and data warehousing 2.6 Multi-processing machines 2.7 Cost justification Summary Review Questions

6.2 Information content of a message 6.3 Noise and redundancy 6.4 Significance of noise 6.5 Fuzzy databases 6.6 The traditional theory of the relational database 6.7 From relations to tables 6.7.1 From keys to statistical dependencies 6.8 Denormalization 6.9 Data mining primitives Summary Review Questions

Chapter – 7 Data Mining

7.1 Introduction 7.2 Data 7.3 Information 7.4 Knowledge 7.5 Historical Note: Many names of Data Mining 7.6 Data Mining 7.6.1 Some of the definitions of Data Mining 7.7 Why Data Mining 7.8 Why Data Mining is Important? 7.9 Uses of Data Mining 7.10 Data Mining Models 7.10.1 Verification Model 7.10.2 Discovery Model 7.11 Development of data mining 7.12 Applications of Data Mining 7.12.1 Healthcare 7.12.2 Finance 7.12.3 Retail Industry 7.12.4 Telecommunication 7.12.5 Text Mining and Web Mining 7.12.6 Higher Education 7.13 Basic Data Mining Tasks / Taxonomy of data mining tasks 7.13.1 Prediction methods 7.13.2 Descriptive methods 7.14 Data Mining Vs Database 7.15 Data Mining Vs KDD

7.16 Steps in Data Mining Process / Steps involved in KDD 7.17 Architecture of a typical data mining system 7.18 Future Trends 7.18.1 Data Trends 7.18.2 Hardware Trends 7.18.3 Network Trends 7.18.4 Scientific Computing Trends 7.18.5 Business Trends 7.19 Major issues in Data Mining / Data Mining Issues 7.20 Data Mining Metrics 7.21 Social Implications of Data Mining 7.22 Data Mining from a database Perspective Summary Review Question

Chapter 8 Advanced Databases

8.1 Various kinds of data / Types of Data 8.1.1 Flat files 8.1.2 Relational Databases 8.1.3 Data Warehouses 8.1.4 Transaction Databases 8.1.5 Object oriented databases 8.1.6 Temporal Databases 8.1.7 Text and Multimedia Databases 8.1.8 Spatial Databases 8.1.9 Time-Series Databases 8.1.10 World Wide Web (WWW) 8.1.11 Heterogeneous databases Summary Review Question

Chapter 9 Data Mining Functionalities, Classification and Case Study

9.1 Data Mining Functionalities 9.2 Pattern Interesting / Interestingness of Patterns 9.2.1 Interestingness measures: 9.2.2 Objective vs. subjective interestingness measures 9.3 Classification of Data Mining Systems

11.6 Genetic Algorithms Summary Review Question

Chapter 12 Data Preprocessing

12.1 1ntroduction

12.2 Why preprocess the data / Need for preprocessing

12.3 Data Preprocessing Techniques / Major Tasks in Data Preprocessing

12.4 Data Cleaning

12.4.1 Missing Data / Values

12.4.1.1 Methods of handling missing data

12.4.2 Noisy Data

12.4.2.1 How to Handle Noisy Data?

12.4.3 Outlier Analysis

12.4.4 Regression

12.5 Data Cleaning as a Process

12.5.1 Discrepancy detection

12.5.2 Discrepancy Detection Tools

12.5.3 Data Transformation

12.5.4 Data Transformation Tools

12.6 Data Integration

12.6.1 Issues to be considered in Data Integration

12.6.1.1 Schema integration

12.6.1.2 Reduction

12.6.1.3 Detecting and resolving data value conflicts

12.6.2 Handling Redundant Data in Data Integration

12.7 Data Transformation

12.7.1 Methods of Data Normalization

12.7.1.1 Min-max normalization

12.7.1.2 z-score normalization

12.7.1.3 Normalization by decimal scaling

12.8 Data Reduction

12.8.1 Data Reduction Strategies

12.8.1.1 Data Cube Aggregation

12.8.1.2 Attribute Subset Selection

12.8.1.3 Dimensionality Reduction

12.8.1.4 Numerosity Reduction

12.8.1.5 Data Discretization and concept hierarchy generation

Data discretization

12.9 Data Mining Query Languages (DMQL)

Summary

Review Questions

Chapter 13 Association Rules

13.1 Association Rules

13.2 Large Item sets

13.3 Basic Algorithm

13.3.1 Apriori Algorithm

13.3.2 Partitioning

13.4 Parallel and Distributed Algorithms

13.4.1 Data parallelism

13.4.2 Task parallelism

13.5 Comparing Approaches

13.6 Incremental Rules

13.7 Advanced Association Rule Techniques

13.7.1 Generalized association rules

13.7.2 Multiple-level association rules

13.7.3 Quantitative association rules

13.7.4 Using Multiple Minimum Supports

13.8 Measuring the Quality of Rules

Summary

Review Questions

Chapter 14 Concept Description: Generalization and Characterization

14.1 Concept Description 14.2 Data Generalization and Summarization-based 14.2.1 Data Generalization 14.2.2 Characterization: Data Cube Approach 14.2.3 Attribute oriented induction for data characterization

16.1.1 Classification algorithms based on the categorization: Issues in Classification 16.2 Statistical-Based Algorithms 16.2.1 Regression 16.2.2 Bayesian classification 16.2.3 Naïve Bayes Classifier 16.3 Distance-Based Algorithms 16.3.1 Simply Approach 16.3.2 K Nearest Neighbors 16.4 Decision Tree-Based Algorithms 16.4.1 C4. 16.4.2 CART 16.4.2.1 Scalable DT techniques 16.5 Neural Network-Based Algorithms 16.5.1 Propagation 16.5.2 NN supervised learning 16.5.3 Radial Basis Function Networks 16.5.4 Perceptron 16.6 Rule-Based Algorithms 16.6.1 Generating Rules from a DT 16.6.2 Generating Rules form a Neural Net 16.6.3 Generating Rules without a DT or NN 16.7 Combining Techniques Summary Review Questions

Chatper-17 Classification and Prediction

17.1 Classification 17.1.1 Classification—A Two-Step Process 17.1.2 Prediction 17.1.3 Issues regarding classification and prediction 17.1.4 Comparing Classification and Prediction Methods 17.2 Classification by decision tree induction 17.2.1 Decision Tree Induction 17.2.2 Attribute Selection Measure 17.2.3 Information Gain (ID3/C4.5) 17.2.4 Gini Index (IBM IntelligentMiner) 17.2.5 Extracting Classification Rules from Trees 17.2.6 Avoid Overfitting in Classification

17.2.7 Enhancements to basic decision tree induction 17.2.8 Classification in Large Databases 17.3 Bayesian Classification: Introduction 17.3.1 Bayesian Classification: Why? 17.3.2 Bayesian Classification 17.3.3 Bayesian Theorem 17.3.4 Naïve Bayes Classifier 17.3.5 Bayesian Belief Networks 17.3.6 Training Bayesian Belief Networks 17.4 Rule Based Classification 17.4.1 Using IF-THEN Rules for Classification 17.4.2 Rule Extraction from a Decision Tree 17.4.3 Rule induction using a Sequential Conversing Algorithm 17.4.4 Rule Quality Measures 17.5 Classification by backpropagation 17.6 Classification based on concepts from association rule mining/ Association-Based Classification / Classification by association Rules 17.7 Lazy Learners (or Learning from Your Neighbors) 17.7.1 k-Nearest Neighbor 17.7.2 Case-Based Reasoning (CBR) 17.8 Other Classification Methods 17.8.1 Genetic Algorithms 17.8.2 Rough Set Approach 17.8.3 Fuzzy Sets Approaches 17.9 Prediction 17.10 Classification accuracy 17.10.1 Classification Accuracy: Estimating Error Rates Summary Review Questions

Chapter- 18 Clustering

18.1 Introduction 18.2 Similarity and Distance Measures 18.3 Outliers 18.4 Hierarchical Algorithms 18.4.1 Agglomerative Algorithms 18.5 Partitional Algorithms 18.5.1 Minimum spanning tree 18.5.2 Squared Error Clustering Algorithm

19.9.3 CURE

19.9.4 ROCK

19.9.5 CHAMELEON

19.10 Density-Based Methods 19.10.1 DBSCAN 19.10.2 OPTICS 19.10.3 DENCLUE 19.11 Grid-Based Methods 19.11.1 STING 19.11.2 WaveCluster 19.11.3 CLIQUE 19.12 Model-Based Clustering Methods 19.12.1 Expectation – Maximization (EM) 19.12.2 Conceptual clustering 19.12.3 Neural network approaches 19.13 Outlier Analysis 19.13.1 Outlier Discovery: Statistical Approaches 19.13.2 Outlier Discovery: Distance-Based Approach 19.13.3 Outlier Discovery: Deviation-Based Approach Summary Review Questions

Chapter 20 Advanced Topics (Mining Complex types of data)

20.1 Multidimensional analysis and descriptive mining of complex data objects 20.1.1 Generalization of Structured Data 20.1.2 Generalizing Spatial and Multimedia Data 20.1.3 Generalizing Object Data 20.1.4 Generalization-based Mining of Plan Databases by Divide and Conquer 20.2 Mining Spatial Data Mining 20.2.1 Dimensions and Measures in Spatial Data Warehouse 20.2.2 Mining Spatial Association and Co-location Patterns 20.2.3 Spatial Classification and Spatial Trend Analysis 20.3 Mining multimedia databases 20.3.1 Similarity Search in Multimedia Data 20.3.2 Multidimensional Analysis of Multimedia Data 20.4 Mining time-series and sequence data

20.4.1 Time-series database 20.4.2 Mining Time-Series and Sequence Data: Trend analysis

20.4.3 Estimation of Trend Curve 20.4.4 Discovery of Trend in Time-Series 20.4.5 Multidimensional Indexing 20.4.6 Subsequence Matching 20.4.7 Query Languages for Time Sequences 20.5 Text Mining / Mining text databases 20.5.1 Text Data Analysis and Information Retrieval 20.5.2 Text Indexing Techniques 20.5.3 Text Mining Approaches 20.6 Mining the World-Wide Web / Web Mining

Chapter 21 Applications and Trends in Data Mining

21.1 Applications of Data Mining 21.1.1 Data Mining for Financial Data Analysis 21.1.2 Data Mining for Retail Industry 21.1.3 Data Mining for Telecommunication Industry 21.1.4 Biomedical Data Mining and DNA Analysis 21.1.5 Data Mining Applications in Sales/Marketing 21.1.6 Data Mining Applications in Banking / Finance 21.1.7 Data Mining Applications in Health Care and Insurance 21.2 Data mining system products and research prototypes 21.2.1 How to choose a data mining system? 21.2.2 Examples of Data Mining Systems 21.3 Additional themes on data mining 21.3.1 Theoretical Foundations of Data Mining 21.3.2 Statistical Data Mining 21.4 Social impact of data mining 21.5 Trends in data mining Summary Review Questions

PART – II DATA WAREHOUSING

Chapter 22 Data warehousing

22.1 Introduction 22.2 Characteristics of Data Warehouse

Chapter 25 Data Warehouse Architecture

25.1 Data Warehouse architecture 25.1.1 Steps for the design and construction of data warehouse 25.1.2 Data Warehouse Design Process 25.1.3 Three – Tier Data Warehouse Architecture 25.1.3.1 Enterprise Warehouse 25.1.3.2 Data Mart 25.1.3.3 Virtual data warehouse 25.2Data warehouse Back-End Tools and Utilities 25.3 Metadata Repository 25.4 OLAP Engine 25.4.1 Relational OLAP (ROLAP) 25.4.2 Multidimensional OLAP (MOLAP) 25.4.3 Hybrid OLAP (HOALP) 25.4.4 Specialized Servers Summary Review Questions

Chapter 26 Data Warehouse Implementation

26.1 Data Warehouse Implementation 26.1.1 Efficient Computation of Data Cubes 26.1.2 Cube Operation 26.1.3 Indexing OLAP Data: Bitmap Index 26.1.4 Indexing OLAP Data: Join Indices 26.1.5 Efficient Processing OLAP Queries Summary Review Questions

Chapter 27 Mapping the data warehouse to a multiprocessor architecture

27.1 Relational database technology for data warehouse 27.1.1 Types of parallelism 27.1.2 Data partitioning 27.2 Data base architecture for parallel processing 27.2.1 Shared-memory architecture 27.2.2 Shared-disk architecture 27.2.3 Shared-nothing architecture

27.2.4 Combined architecture

27.3 Parallel RDMBS features 27.4 Alternative technologies 27.5 Parallel DBMS Vendors 27.5.1 Oracle 27.5.2 Informix 27.5.4 Sybase 27.5.5 Microsoft Summary Review Questions

Chapter 28 Reporting and Query Tools and Applications

28.1Tool categories 28.1.1 Reporting tools 28.1.2 Managed Query Tools 28.1.3 Executive information tools 28.1.4 OLAP tools 28.1.5 Data mining tools 28.2 Need for application 28.3 Cognos impromptu 28.4Applications 28.4.1PowerBuilder Summary Review Questions

Chapter 29 On-Line Analytical Processing (OLAP)

29.1 Introduction 29.2 Need for OLAP 29.3 Multidimensional data model 29.3.1 From Tables and Spreadsheets to Data Cubes 29.4 OLAP Guidelines / OLAP Product Evaluation Rules 29.5 Data Warehouse Schema / OLAP Schema 29.5.1 Star Schema 29.5.2 Star Schema Keys 29.5.3 Advantages of Star schema 29.5.4 Snow Flake Schema

Summary Review Questions

Chapter 32 Operating the data warehouse

32.1 Introduction 32.2 Day-To Day Operations of the Data Warehouse 32.3 Overnight Processing Summary Review Questions

Chapter 33 Capacity Planning

33.1 Process 33.2 Estimating the Load 33.2.1 Initial Configuration 33.2.2 How much CPU bandwidth 33.2.3 How Much Memory 33.2.4 How much disk? Summary Review Questions

Chapter 34 Tuning and testing the data warehouse

34.1 Tuning the Data Load 34.2 Prioritized Tuning Steps 34.3 Tuning Queries 34.3.1 Fixed queries 34.3.2 AD HOC queries

  1. 4 Testing the Data Warehouse 34.4. 1 Introduction 34.4.2 The Testing Terminologies 34.4.3 Testing the operational environment 34.4.5 Testing the database 34.4.5.1 Testing database manager and monitoring tools 34.4.5.2 Testing database features 34.4.5.3 Testing database performance 34.5 Testing the Application Summary

Review Questions

Chapter 35 Backup and Recovery

35.1 Introduction 35.1.1 Types of Backup 35.2 Data Warehouse Recovery Models 35.3 Define Backup and Recovery Strategy 35.4 Security Impact on Design of Data Warehouse 35.4.1 Application Development 35.4.2 Database Design 35.4.3 Testing 34.5 Disaster Recovery Summary Review Questions

APPENDIX A; Glossary

APPENDIX B : Two marks Questions with Answers

APPENDIX C: Past University Question Papers

BIBLIOGRAPHY