Machine Learning with Spark and Python

Essential Techniques for Predictive Analytics

Häftad, Engelska, 2019

659 kr

Beställningsvara. Skickas inom 3-6 vardagar

Fri frakt för medlemmar vid köp för minst 249 kr.

Machine Learning with Spark and Python Essential Techniques for Predictive Analytics, Second Edition simplifies ML for practical uses by focusing on two key algorithms. This new second edition improves with the addition of Spark—a ML framework from the Apache foundation. By implementing Spark, machine learning students can easily process much large data sets and call the spark algorithms using ordinary Python code. Machine Learning with Spark and Python focuses on two algorithm families (linear methods and ensemble methods) that effectively predict outcomes. This type of problem covers many use cases such as what ad to place on a web page, predicting prices in securities markets, or detecting credit card fraud. The focus on two families gives enough room for full descriptions of the mechanisms at work in the algorithms. Then the code examples serve to illustrate the workings of the machinery with specific hackable code.

Produktinformation

Utgivningsdatum2019-12-05
Mått185 x 229 x 23 mm
Vikt612 g
FormatHäftad
SpråkEngelska
Antal sidor368
Upplaga2
FörlagJohn Wiley & Sons Inc
ISBN9781119561934

Tillhör följande kategorier

Artificiell intelligens inom Data och it

MICHAEL BOWLES teaches machine learning at UC Berkeley, University of New Haven and Hacker Dojo in Silicon Valley, consults on machine learning projects, and is involved in a number of startups in such areas as semi conductor inspection, drug design and optimization and trading in the financial markets. Following an assistant professorship at MIT, Michael went on to found and run two Silicon Valley startups, both of which went public. His courses are always popular and receive great feedback from participants.

Introduction xxiChapter 1 The Two Essential Algorithms for Making Predictions 1Why are These Two Algorithms So Useful? 2What are Penalized Regression Methods? 7What are Ensemble Methods? 9How to Decide Which Algorithm to Use 11The Process Steps for Building a Predictive Model 13Framing a Machine Learning Problem 15Feature Extraction and Feature Engineering 17Determining Performance of a Trained Model 18Chapter Contents and Dependencies 18Summary 20Chapter 2 Understand the Problem by Understanding the Data 23The Anatomy of a New Problem 24Different Types of Attributes and Labels Drive Modeling Choices 26Things to Notice about Your New Data Set 27Classification Problems: Detecting Unexploded Mines Using Sonar 28Physical Characteristics of the Rocks Versus Mines Data Set 29Statistical Summaries of the Rocks Versus Mines Data Set 32Visualization of Outliers Using a Quantile-Quantile Plot 34Statistical Characterization of Categorical Attributes 35How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set 36Visualizing Properties of the Rocks Versus Mines Data Set 39Visualizing with Parallel Coordinates Plots 39Visualizing Interrelationships between Attributes and Labels 41Visualizing Attribute and Label Correlations Using a Heat Map 48Summarizing the Process for Understanding the Rocks Versus Mines Data Set 50Real-Valued Predictions with Factor Variables: How Old is Your Abalone? 50Parallel Coordinates for Regression Problems—Visualize Variable Relationships for the Abalone Problem 55How to Use a Correlation Heat Map for Regression—Visualize Pair-Wise Correlations for the Abalone Problem 59Real-Valued Predictions Using Real-Valued Attributes: Calculate How Your Wine Tastes 61Multiclass Classification Problem: What Type of Glass is That? 67Using PySpark to Understand Large Data Sets 72Summary 75Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data 77The Basic Problem: Understanding Function Approximation 78Working with Training Data 79Assessing Performance of Predictive Models 81Factors Driving Algorithm Choices and Performance—Complexity and Data 82Contrast between a Simple Problem and a Complex Problem 82Contrast between a Simple Model and a Complex Model 85Factors Driving Predictive Algorithm Performance 89Choosing an Algorithm: Linear or Nonlinear? 90Measuring the Performance of Predictive Models 91Performance Measures for Different Types of Problems 91Simulating Performance of Deployed Models 105Achieving Harmony between Model and Data 107Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size 107Using Forward Stepwise Regression to Control Overfitting 109Evaluating and Understanding Your Predictive Model 114Control Overfitting by Penalizing Regression Coefficients—Ridge Regression 116Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets 124Summary 127Chapter 4 Penalized Linear Regression 129Why Penalized Linear Regression Methods are So Useful 130Extremely Fast Coefficient Estimation 130Variable Importance Information 131Extremely Fast Evaluation When Deployed 131Reliable Performance 131Sparse Solutions 132Problem May Require Linear Model 132When to Use Ensemble Methods 132Penalized Linear Regression: Regulating Linear Regression for Optimum Performance 132Training Linear Models: Minimizing Errors and More 135Adding a Coefficient Penalty to the OLS Formulation 136Other Useful Coefficient Penalties—Manhattan and ElasticNet 137Why Lasso Penalty Leads to Sparse Coefficient Vectors 138ElasticNet Penalty Includes Both Lasso and Ridge 140Solving the Penalized Linear Regression Problem 141Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression 141How LARS Generates Hundreds of Models of Varying Complexity 145Choosing the Best Model from the Hundreds LARS Generates 147Using Glmnet: Very Fast and Very General 152Comparison of the Mechanics of Glmnet and LARS Algorithms 153Initializing and Iterating the Glmnet Algorithm 153Extension of Linear Regression to Classification Problems 157Solving Classification Problems with Penalized Regression 157Working with Classification Problems Having More Than Two Outcomes 161Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems 161Incorporating Non-Numeric Attributes into Linear Methods 163Summary 166Chapter 5 Building Predictive Models Using Penalized Linear Methods 169Python Packages for Penalized Linear Regression 170Multivariable Regression: Predicting Wine Taste 171Building and Testing a Model to Predict Wine Taste 172Training on the Whole Data Set before Deployment 175Basis Expansion: Improving Performance by Creating New Variables from Old Ones 179Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines 182Build a Rocks Versus Mines Classifier for Deployment 191Multiclass Classification: Classifying Crime Scene Glass Samples 200Linear Regression and Classification Using PySpark 203Using PySpark to Predict Wine Taste 204Logistic Regression with PySpark: Rocks Versus Mines 208Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings 213Multiclass Logistic Regression with Meta Parameter Optimization 217Summary 219Chapter 6 Ensemble Methods 221Binary Decision Trees 222How a Binary Decision Tree Generates Predictions 224How to Train a Binary Decision Tree 225Tree Training Equals Split Point Selection 227How Split Point Selection Affects Predictions 228Algorithm for Selecting Split Points 229Multivariable Tree Training—Which Attribute to Split? 229Recursive Splitting for More Tree Depth 230Overfitting Binary Trees 231Measuring Overfit with Binary Trees 231Balancing Binary Tree Complexity for Best Performance 232Modifi cations for Classification and Categorical Features 235Bootstrap Aggregation: “Bagging” 235How Does the Bagging Algorithm Work? 236Bagging Performance—Bias Versus Variance 239How Bagging Behaves on Multivariable Problem 241Bagging Needs Tree Depth for Performance 245Summary of Bagging 246Gradient Boosting 246Basic Principle of Gradient Boosting Algorithm 246Parameter Settings for Gradient Boosting 249How Gradient Boosting Iterates toward a Predictive Model 249Getting the Best Performance from Gradient Boosting 250Gradient Boosting on a Multivariable Problem 253Summary for Gradient Boosting 256Random Forests 256Random Forests: Bagging Plus Random Attribute Subsets 259Random Forests Performance Drivers 260Random Forests Summary 261Summary 262Chapter 7 Building Ensemble Models with Python 265Solving Regression Problems with Python Ensemble Packages 265Using Gradient Boosting to Predict Wine Taste 266Using the Class Constructor for GradientBoostingRegressor 266Using GradientBoostingRegressor to Implement a Regression Model 268Assessing the Performance of a Gradient Boosting Model 271Building a Random Forest Model to Predict Wine Taste 272Constructing a RandomForestRegressor Object 273Modeling Wine Taste with RandomForestRegressor 275Visualizing the Performance of a Random Forest Regression Model 279Incorporating Non-Numeric Attributes in Python Ensemble Models 279Coding the Sex of Abalone for Gradient Boosting Regression in Python 280Assessing Performance and the Importance of Coded Variables with Gradient Boosting 282Coding the Sex of Abalone for Input to Random Forest Regression in Python 284Assessing Performance and the Importance of Coded Variables 287Solving Binary Classification Problems with Python Ensemble Methods 288Detecting Unexploded Mines with Python Gradient Boosting 288Determining the Performance of a Gradient Boosting Classifier 291Detecting Unexploded Mines with Python Random Forest 292Constructing a Random Forest Model to Detect Unexploded Mines 294Determining the Performance of a Random Forest Classifier 298Solving Multiclass Classification Problems with Python Ensemble Methods 300Dealing with Class Imbalances 301Classifying Glass Using Gradient Boosting 301Determining the Performance of the Gradient Boosting Model on Glass Classification 306Classifying Glass with Random Forests 307Determining the Performance of the Random Forest Model on Glass Classification 310Solving Regression Problems with PySpark Ensemble Packages 311Predicting Wine Taste with PySpark Ensemble Methods 312Predicting Abalone Age with PySpark Ensemble Methods 317Distinguishing Mines from Rocks with PySparkEnsemble Methods 321Identifying Glass Types with PySpark Ensemble Methods 325Summary 327Index 329