Data Mining and Predictive Analytics
Inbunden, Engelska, 2015
Av Daniel T. Larose, USA) Larose, Daniel T. (Central Connecticut State University
2 079 kr
Produktinformation
- Utgivningsdatum2015-04-24
- Mått163 x 243 x 48 mm
- Vikt1 207 g
- FormatInbunden
- SpråkEngelska
- SerieWiley Series on Methods and Applications in Data Mining
- Antal sidor824
- Upplaga2
- FörlagJohn Wiley & Sons Inc
- ISBN9781118116197
Tillhör följande kategorier
Daniel T. Larose is Professor of Mathematical Sciences and Director of the Data Mining programs at Central Connecticut State University. He has published several books, including Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage (Wiley, 2007) and Discovering Knowledge in Data: An Introduction to Data Mining (Wiley, 2005). In addition to his scholarly work, Dr. Larose is a consultant in data mining and statistical analysis working with many high profile clients, including Microsoft, Forbes Magazine, the CIT Group, KPMG International, Computer Associates, and Deloitte, Inc.Chantal D. Larose is an Assistant Professor of Statistics & Data Science at Eastern Connecticut State University (ECSU). She has co-authored three books on data science and predictive analytics. She helped develop data science programs at ECSU and at SUNY New Paltz. She received her PhD in Statistics from the University of Connecticut, Storrs in 2015 (dissertation title: Model-based Clustering of Incomplete Data).
- PREFACE xxiACKNOWLEDGMENTS xxixPART I DATA PREPARATION 1CHAPTER 1 AN INTRODUCTION TO DATA MINING AND PREDICTIVE ANALYTICS 31.1 What is Data Mining? What is Predictive Analytics? 31.2 Wanted: Data Miners 51.3 The Need for Human Direction of Data Mining 61.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM 61.4.1 CRISP-DM: The Six Phases 71.5 Fallacies of Data Mining 91.6 What Tasks Can Data Mining Accomplish 10CHAPTER 2 DATA PREPROCESSING 202.1 Why do We Need to Preprocess the Data? 202.2 Data Cleaning 212.3 Handling Missing Data 222.4 Identifying Misclassifications 252.5 Graphical Methods for Identifying Outliers 262.6 Measures of Center and Spread 272.7 Data Transformation 302.8 Min–Max Normalization 302.9 Z-Score Standardization 312.10 Decimal Scaling 322.11 Transformations to Achieve Normality 322.12 Numerical Methods for Identifying Outliers 382.13 Flag Variables 392.14 Transforming Categorical Variables into Numerical Variables 402.15 Binning Numerical Variables 412.16 Reclassifying Categorical Variables 422.17 Adding an Index Field 432.18 Removing Variables that are not Useful 432.19 Variables that Should Probably not be Removed 432.20 Removal of Duplicate Records 442.21 A Word About ID Fields 45CHAPTER 3 EXPLORATORY DATA ANALYSIS 543.1 Hypothesis Testing Versus Exploratory Data Analysis 543.2 Getting to Know the Data Set 543.3 Exploring Categorical Variables 563.4 Exploring Numeric Variables 643.5 Exploring Multivariate Relationships 693.6 Selecting Interesting Subsets of the Data for Further Investigation 703.7 Using EDA to Uncover Anomalous Fields 713.8 Binning Based on Predictive Value 723.9 Deriving New Variables: Flag Variables 753.10 Deriving New Variables: Numerical Variables 773.11 Using EDA to Investigate Correlated Predictor Variables 783.12 Summary of Our EDA 81CHAPTER 4 DIMENSION-REDUCTION METHODS 924.1 Need for Dimension-Reduction in Data Mining 924.2 Principal Components Analysis 934.3 Applying PCA to the Houses Data Set 964.4 How Many Components Should We Extract? 1024.5 Profiling the Principal Components 1054.6 Communalities 1084.7 Validation of the Principal Components 1104.8 Factor Analysis 1104.9 Applying Factor Analysis to the Adult Data Set 1114.10 Factor Rotation 1144.11 User-Defined Composites 1174.12 An Example of a User-Defined Composite 118PART II STATISTICAL ANALYSIS 129CHAPTER 5 UNIVARIATE STATISTICAL ANALYSIS 1315.1 Data Mining Tasks in Discovering Knowledge in Data 1315.2 Statistical Approaches to Estimation and Prediction 1315.3 Statistical Inference 1325.4 How Confident are We in Our Estimates? 1335.5 Confidence Interval Estimation of the Mean 1345.6 How to Reduce the Margin of Error 1365.7 Confidence Interval Estimation of the Proportion 1375.8 Hypothesis Testing for the Mean 1385.9 Assessing the Strength of Evidence Against the Null Hypothesis 1405.10 Using Confidence Intervals to Perform Hypothesis Tests 1415.11 Hypothesis Testing for the Proportion 143CHAPTER 6 MULTIVARIATE STATISTICS 1486.1 Two-Sample t-Test for Difference in Means 1486.2 Two-Sample Z-Test for Difference in Proportions 1496.3 Test for the Homogeneity of Proportions 1506.4 Chi-Square Test for Goodness of Fit of Multinomial Data 1526.5 Analysis of Variance 153CHAPTER 7 PREPARING TO MODEL THE DATA 1607.1 Supervised Versus Unsupervised Methods 1607.2 Statistical Methodology and Data Mining Methodology 1617.3 Cross-Validation 1617.4 Overfitting 1637.5 Bias–Variance Trade-Off 1647.6 Balancing the Training Data Set 1667.7 Establishing Baseline Performance 167CHAPTER 8 SIMPLE LINEAR REGRESSION 1718.1 An Example of Simple Linear Regression 1718.2 Dangers of Extrapolation 1778.3 How Useful is the Regression? The Coefficient of Determination, r2 1788.4 Standard Error of the Estimate, s 1838.5 Correlation Coefficient r 1848.6 Anova Table for Simple Linear Regression 1868.7 Outliers, High Leverage Points, and Influential Observations 1868.8 Population Regression Equation 1958.9 Verifying the Regression Assumptions 1988.10 Inference in Regression 2038.11 t-Test for the Relationship Between x and y 2048.12 Confidence Interval for the Slope of the Regression Line 2068.13 Confidence Interval for the Correlation Coefficient p 2088.14 Confidence Interval for the Mean Value of y Given x 2108.15 Prediction Interval for a Randomly Chosen Value of y Given x 2118.16 Transformations to Achieve Linearity 2138.17 Box–Cox Transformations 220CHAPTER 9 MULTIPLE REGRESSION AND MODEL BUILDING 2369.1 An Example of Multiple Regression 2369.2 The Population Multiple Regression Equation 2429.3 Inference in Multiple Regression 2439.4 Regression with Categorical Predictors, Using Indicator Variables 2499.5 Adjusting R2: Penalizing Models for Including Predictors that are not Useful 2569.6 Sequential Sums of Squares 2579.7 Multicollinearity 2589.8 Variable Selection Methods 2669.9 Gas Mileage Data Set 2709.10 An Application of Variable Selection Methods 2719.11 Using the Principal Components as Predictors in Multiple Regression 279PART III CLASSIFICATION 299CHAPTER 10 k-NEAREST NEIGHBOR ALGORITHM 30110.1 Classification Task 30110.2 k-Nearest Neighbor Algorithm 30210.3 Distance Function 30510.4 Combination Function 30710.5 Quantifying Attribute Relevance: Stretching the Axes 30910.6 Database Considerations 31010.7 k-Nearest Neighbor Algorithm for Estimation and Prediction 31010.8 Choosing k 31110.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler 312CHAPTER 11 DECISION TREES 31711.1 What is a Decision Tree? 31711.2 Requirements for Using Decision Trees 31911.3 Classification and Regression Trees 31911.4 C4.5 Algorithm 32611.5 Decision Rules 33211.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data 332CHAPTER 12 NEURAL NETWORKS 33912.1 Input and Output Encoding 33912.2 Neural Networks for Estimation and Prediction 34212.3 Simple Example of a Neural Network 34212.4 Sigmoid Activation Function 34412.5 Back-Propagation 34512.6 Gradient-Descent Method 34612.7 Back-Propagation Rules 34712.8 Example of Back-Propagation 34712.9 Termination Criteria 34912.10 Learning Rate 35012.11 Momentum Term 35112.12 Sensitivity Analysis 35312.13 Application of Neural Network Modeling 353CHAPTER 13 LOGISTIC REGRESSION 35913.1 Simple Example of Logistic Regression 35913.2 Maximum Likelihood Estimation 36113.3 Interpreting Logistic Regression Output 36213.4 Inference: are the Predictors Significant? 36313.5 Odds Ratio and Relative Risk 36513.6 Interpreting Logistic Regression for a Dichotomous Predictor 36713.7 Interpreting Logistic Regression for a Polychotomous Predictor 37013.8 Interpreting Logistic Regression for a Continuous Predictor 37413.9 Assumption of Linearity 37813.10 Zero-Cell Problem 38213.11 Multiple Logistic Regression 38413.12 Introducing Higher Order Terms to Handle Nonlinearity 38813.13 Validating the Logistic Regression Model 39513.14 WEKA: Hands-On Analysis Using Logistic Regression 399CHAPTER 14 NAÏVE BAYES AND BAYESIAN NETWORKS 41414.1 Bayesian Approach 41414.2 Maximum a Posteriori (Map) Classification 41614.3 Posterior Odds Ratio 42014.4 Balancing the Data 42214.5 Naïve Bayes Classification 42314.6 Interpreting the Log Posterior Odds Ratio 42614.7 Zero-Cell Problem 42814.8 Numeric Predictors for Naïve Bayes Classification 42914.9 WEKA: Hands-on Analysis Using Naïve Bayes 43214.10 Bayesian Belief Networks 43614.11 Clothing Purchase Example 43614.12 Using the Bayesian Network to Find Probabilities 439CHAPTER 15 MODEL EVALUATION TECHNIQUES 45115.1 Model Evaluation Techniques for the Description Task 45115.2 Model Evaluation Techniques for the Estimation and Prediction Tasks 45215.3 Model Evaluation Measures for the Classification Task 45415.4 Accuracy and Overall Error Rate 45615.5 Sensitivity and Specificity 45715.6 False-Positive Rate and False-Negative Rate 45815.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives 45815.8 Misclassification Cost Adjustment to Reflect Real-World Concerns 46015.9 Decision Cost/Benefit Analysis 46215.10 Lift Charts and Gains Charts 46315.11 Interweaving Model Evaluation with Model Building 46615.12 Confluence of Results: Applying a Suite of Models 466CHAPTER 16 COST-BENEFIT ANALYSIS USING DATA-DRIVEN COSTS 47116.1 Decision Invariance Under Row Adjustment 47116.2 Positive Classification Criterion 47316.3 Demonstration of the Positive Classification Criterion 47416.4 Constructing the Cost Matrix 47416.5 Decision Invariance Under Scaling 47616.6 Direct Costs and Opportunity Costs 47816.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs 47816.8 Rebalancing as a Surrogate for Misclassification Costs 483CHAPTER 17 COST-BENEFIT ANALYSIS FOR TRINARY AND k-NARY CLASSIFICATION MODELS 49117.1 Classification Evaluation Measures for a Generic Trinary Target 49117.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem 49417.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem 49817.4 Comparing Cart Models with and without Data-Driven Misclassification Costs 50017.5 Classification Evaluation Measures for a Generic k-Nary Target 50317.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification 504CHAPTER 18 GRAPHICAL EVALUATION OF CLASSIFICATION MODELS 51018.1 Review of Lift Charts and Gains Charts 51018.2 Lift Charts and Gains Charts Using Misclassification Costs 51018.3 Response Charts 51118.4 Profits Charts 51218.5 Return on Investment (ROI) Charts 514PART IV CLUSTERING 521CHAPTER 19 HIERARCHICAL AND k-MEANS CLUSTERING 52319.1 The Clustering Task 52319.2 Hierarchical Clustering Methods 52519.3 Single-Linkage Clustering 52619.4 Complete-Linkage Clustering 52719.5 k-Means Clustering 52919.6 Example of k-Means Clustering at Work 53019.7 Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds 53319.8 Application of k-Means Clustering Using SAS Enterprise Miner 53419.9 Using Cluster Membership to Predict Churn 537CHAPTER 20 KOHONEN NETWORKS 54220.1 Self-Organizing Maps 54220.2 Kohonen Networks 54420.3 Example of a Kohonen Network Study 54520.4 Cluster Validity 54920.5 Application of Clustering Using Kohonen Networks 54920.6 Interpreting The Clusters 55120.7 Using Cluster Membership as Input to Downstream Data Mining Models 556CHAPTER 21 BIRCH CLUSTERING 56021.1 Rationale for Birch Clustering 56021.2 Cluster Features 56121.3 Cluster Feature Tree 56221.4 Phase 1: Building the CF Tree 56221.5 Phase 2: Clustering the Sub-Clusters 56421.6 Example of Birch Clustering, Phase 1: Building the CF Tree 56521.7 Example of Birch Clustering, Phase 2: Clustering the Sub-Clusters 57021.8 Evaluating the Candidate Cluster Solutions 57121.9 Case Study: Applying Birch Clustering to the Bank Loans Data Set 571CHAPTER 22 MEASURING CLUSTER GOODNESS 58222.1 Rationale for Measuring Cluster Goodness 58222.2 The Silhouette Method 58322.3 Silhouette Example 58422.4 Silhouette Analysis of the IRIS Data Set 58522.5 The Pseudo-F Statistic 59022.6 Example of the Pseudo-F Statistic 59122.7 Pseudo-F Statistic Applied to the IRIS Data Set 59222.8 Cluster Validation 59322.9 Cluster Validation Applied to the Loans Data Set 594PART V ASSOCIATION RULES 601CHAPTER 23 ASSOCIATION RULES 60323.1 Affinity Analysis and Market Basket Analysis 60323.2 Support, Confidence, Frequent Itemsets, and the a Priori Property 60523.3 How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets 60723.4 How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules 60823.5 Extension from Flag Data to General Categorical Data 61123.6 Information-Theoretic Approach: Generalized Rule Induction Method 61223.7 Association Rules are Easy to do Badly 61423.8 How can we Measure the Usefulness of Association Rules? 61523.9 Do Association Rules Represent Supervised or Unsupervised Learning? 61623.10 Local Patterns Versus Global Models 617PART VI ENHANCING MODEL PERFORMANCE 623CHAPTER 24 SEGMENTATION MODELS 62524.1 The Segmentation Modeling Process 62524.2 Segmentation Modeling Using EDA to Identify the Segments 62724.3 Segmentation Modeling using Clustering to Identify the Segments 629CHAPTER 25 ENSEMBLE METHODS: BAGGING AND BOOSTING 63725.1 Rationale for Using an Ensemble of Classification Models 63725.2 Bias, Variance, and Noise 63925.3 When to Apply, and not to apply, Bagging 64025.4 Bagging 64125.5 Boosting 64325.6 Application of Bagging and Boosting Using IBM/SPSS Modeler 647CHAPTER 26 MODEL VOTING AND PROPENSITY AVERAGING 65326.1 Simple Model Voting 65326.2 Alternative Voting Methods 65426.3 Model Voting Process 65526.4 An Application of Model Voting 65626.5 What is Propensity Averaging? 66026.6 Propensity Averaging Process 66126.7 An Application of Propensity Averaging 661PART VII FURTHER TOPICS 669CHAPTER 27 GENETIC ALGORITHMS 67127.1 Introduction To Genetic Algorithms 67127.2 Basic Framework of a Genetic Algorithm 67227.3 Simple Example of a Genetic Algorithm at Work 67327.4 Modifications and Enhancements: Selection 67627.5 Modifications and Enhancements: Crossover 67827.6 Genetic Algorithms for Real-Valued Variables 67927.7 Using Genetic Algorithms to Train a Neural Network 68127.8 WEKA: Hands-On Analysis Using Genetic Algorithms 684CHAPTER 28 IMPUTATION OF MISSING DATA 69528.1 Need for Imputation of Missing Data 69528.2 Imputation of Missing Data: Continuous Variables 69628.3 Standard Error of the Imputation 69928.4 Imputation of Missing Data: Categorical Variables 70028.5 Handling Patterns in Missingness 701PART VIII CASE STUDY: PREDICTING RESPONSE TO DIRECT-MAIL MARKETING 705CHAPTER 29 CASE STUDY, PART 1: BUSINESS UNDERSTANDING, DATA PREPARATION, AND EDA 70729.1 Cross-Industry Standard Practice for Data Mining 70729.2 Business Understanding Phase 70929.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set 71029.4 Data Preparation Phase 71429.5 Data Understanding Phase, Part 2: Exploratory Data Analysis 721CHAPTER 30 CASE STUDY, PART 2: CLUSTERING AND PRINCIPAL COMPONENTS ANALYSIS 73230.1 Partitioning the Data 73230.2 Developing the Principal Components 73330.3 Validating the Principal Components 73730.4 Profiling the Principal Components 73730.5 Choosing the Optimal Number of Clusters Using Birch Clustering 74230.6 Choosing the Optimal Number of Clusters Using k-Means Clustering 74430.7 Application of k-Means Clustering 74530.8 Validating the Clusters 74530.9 Profiling the Clusters 745CHAPTER 31 CASE STUDY, PART 3: MODELING AND EVALUATION FOR PERFORMANCE AND INTERPRETABILITY 74931.1 Do you Prefer the Best Model Performance, or a Combination of Performance and Interpretability? 74931.2 Modeling and Evaluation Overview 75031.3 Cost-Benefit Analysis Using Data-Driven Costs 75131.4 Variables to be Input to the Models 75331.5 Establishing the Baseline Model Performance 75431.6 Models that use Misclassification Costs 75531.7 Models that Need Rebalancing as a Surrogate for Misclassification Costs 75631.8 Combining Models Using Voting and Propensity Averaging 75731.9 Interpreting the Most Profitable Model 758CHAPTER 32 CASE STUDY, PART 4: MODELING AND EVALUATION FOR HIGH PERFORMANCE ONLY 76232.1 Variables to be Input to the Models 76232.2 Models that use Misclassification Costs 76232.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs 76432.4 Combining Models using Voting and Propensity Averaging 76532.5 Lessons Learned 76632.6 Conclusions 766APPENDIX A DATA SUMMARIZATION AND VISUALIZATION 768Part 1: Summarization 1: Building Blocks of Data Analysis 768Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data 770Part 3: Summarization 2: Measures of Center, Variability, and Position 774Part 4: Summarization and Visualization of Bivariate Relationships 777INDEX 781