Quotes worth referencing:
- p. 6 They show that statistically, the more data driven a firm is, the more productive it is.
- p. 9 What can I now do that I couldn’t do before, or do better than I could before?
- p. 13 Understanding data science is critical, because “unlike other technical projects, data science is supporting improved decision-making.”
- p. 29 Customer records and product identifiers are notoriously variable and noisy. Cleaning and matching customer records to ensure only one record per customer is itself a complicated analytics problem
- p 341 Data incorporate the beliefs, purposes, biases, and pragmatics of those who designed the data collection systems
Types of applications:Two broad types:
- Decisions for which “discoveries” need to be made within data and
- decisions that repeat, especially at massive scale, so decision-making can benefit from even small increases in decision-making accuracy based on data analysis
- Marketing
- Customer retention
- Targeted marketing
- recommendation engines
- Sales
- Suggestions for what works (beyond intuition)
Chapter 1: Intro
- Data Science is the fundamental principles used to extract knowledge from data
- Use to discover unusual events, things outside the normal pattern of data
- Ultimate goal is improving decision making
- Structure:
- Data engineering and processing -> data science -> automated data-driven decisions -> data-driven decisions made across a firm
- Two broad types of data science
- Discovering insights
- Automating small improvements in a repeatable scalable way
- Big data is not just data science
- Just utilizing a large dataset isn’t data science, it’s extracting intelligence that matters
- Data and data scientists are complementary
- data worthless without a team that can use it
- data scientists worthless without data
- Thing of data as a business asset, as in, it can be worth paying to acquire if you can get value out of it
- go by the data you need
- spend time losing money to get to profits in the future
- Understanding data science is critical, because “unlike other technical projects, data science is supporting improved decision-making.”
- Extracting useful, non-trivial, and hopefully actionable insights from large datasets
- Fundamental concepts:
- “Extracting useful knowledge from data to sole business problems can be treated systematically by following a process with reasonable well-defined stages.”
- structured thinking about analytics emphasizes careful analysis of a problem
- “From a large mass of data, information technology can be used to find informative descriptive attributes of entities of interest”
- “If you look too hard at a set of data, you will find something — but it might not generalize beyond the data you’re looking at.” overfitting
- “Formulating data minin solutions and evaluating the results involves thinking carefully about the context in which they will be used.”
- “Extracting useful knowledge from data to sole business problems can be treated systematically by following a process with reasonable well-defined stages.”
Chapter 2: Business Problems and Data Science Solutions
- A process with well understood stages
- closer to systemic analyses rather than heroic endeavors driven by chance and individual acumen
- Decompose a business problem it to a set of subtasks, then figure out how to do those subtasks
- example: churn
- what in our historical data makes it likely someone will terminate their contract with us after it is up
- once you identify those things, either predict what causes those things or develop business solutions
- key skill is breaking problems down into subtasks that have known solutions
- Various data mining tasks
- classification and class probability estimation
- predict, for each individual in a population, which of a small set of classes this individual belongs to
- classes are mutually exclusive
- Regression (“value estimation”)
- attempts to predict the numerical value of some variable for the individual
- service usage
- money spent
- classification is will this happen and regression is how much will it happen
- attempts to predict the numerical value of some variable for the individual
- Similarity matching
- attempts to identify similar individuals based on data known about them
- customers similar to you also bought x
- Clustering
- Group individuals in a population together by their similarity
- what natural groups exist
- groups things together based on their own attributes
- co-occurrence grouping
- aka frequent itemset mining, association rule discovery, or market-basket analysis)
- attempts to find associations between entities based on transactions involving them
- groups things together because they occur in the same event together
- i.e. basket
- transaction
- Profiling (behavior description)
- attempts to characterize the typical behavior of an individual, group, or population
- i.e. what is the typical usage of this customer segment?”
- often used to establish norms for anomaly detection applications
- link prediction
- attempts to predict connections between data items
- suggests that a link should exist with some probability
- since you and karen share 10 friends perhaps you should be karen’s friend
- data reduction
- taking a large dataset and replacing it with a smaller one of just key attributes
- trade-offs need to be assessed
- causal modeling
- attempts to help understand what events or actions actually influence others
- tries to analyze what would happen when the thing does or doesn’t happen
- i. e. did or didn’t see an ad
- classification and class probability estimation
- Supervised versus unsupervised methods
- unsupervised: haven’t defined the item of interest
- do our customers fall into different groups
- supervised: have defined the item of interest
- can we find groups likely to do x
- unsupervised: haven’t defined the item of interest
- There’s a difference between mining the data to find patterns and build models and using the results of data mining.
- The data mining process
- Business understanding
- may need to reformulate several times to truly understand the problem
- critical to successfully cast the business problem as one more more data science problems
- think carefully about the problem to be solved and the use scenario
- what exactly do we want to do
- how do we want to do it
- which parts represent possible data mining models
- Data Understanding
- Data is the raw material from which the solution will be built
- Understand the strengths and limitations of the data
- Historical data are often collected for unrelated purposes or no explicit purpose
- customer database
- transaction database
- marketing response database
- What’s the cost of acquiring the data
- free?
- effort?
- purchased?
- costs and benefits of data sources, is further effort worthwhile
- Data Preparation
- Data needs to be in a format differently than it “appears naturally”
- converted to tabular format
- remove or infer missing data
- convert to different types
- i.e. category to binary
- Must be sure data will be available at the point of decision making – “leakage”
- i.e. historical dataset can predict x from y, but will you know y at the point in time it is useful to know
- Data needs to be in a format differently than it “appears naturally”
- Modeling
- The output of a model or pattern capturing regularities in the data
- Evaluation
- Assess results rigorously and gain confidence that they are valid and reliable
- Ensure model satisfied original business goals
- Once deployed, is the data useful? Too many false alarms?
- Will the model do more good than harm?
- Will the model avoid catastrophic failures?
- Will it be comprehensible?
- Deployment
- Putting the results into real use
- eg send offers to customers who are predicted to cancel contracts
- Can deploy the model or the data mining technique itself
- Model:
- Create some new programmatic event based on the model
- Create a new marketing or business strategy based on the model
- Data mining itself deployed:
- Data science team creates working model
- It is coded by production development team into production
- Putting the results into real use
- Iterate
- The process itself creates a lot of insights, and can then be used to take things further
- Implications for managing the data science team
- Data mining is closer to R&D than it is to engineering, so the typical software development cycle is not as applicable
- CRISP cycle is based on exploration, approaches and strategy
- outcomes less certain
- learnings at any one step may need to change process
- Software skills vs. analytics skills
- Software: write efficient, high quality code from requirements
- Analytics: formulate problems well, prototype solutions quickly, make reasonable assumptions in the face of ill-structured problems, design experiments that represent good investments, and analyze results
- Data mining is closer to R&D than it is to engineering, so the typical software development cycle is not as applicable
- Other analytics techniques
- Statistics
- measuring and comparing the distribution of attributes within a population
- Assessing statistical probabilities of certain things
- Testing hypotheses (data mining is as much about generating hypotheses to test)
- Database querying
- Translating an idea or question into a machine-readable format to output a desired set of information, like most expensive houses in new york owned by mothers under 30
- Data warehousing
- collect and coalesce data from across an enterprise, often from multiple different systems
- data collected in one warehouse can be easier to use than data in siloes throughout the business
- Regression analysis
- Looking at historical data to assess how given customers would act according to that historical data
- (vs. data mining regression, which is trying to predict what a new customer would do given past information and how generalizable it may or may not be)
- Machine learning
- Analyzing data from the environment and making predictions about unknown quantities
- subfield of AI
- Statistics
- Answering business questions with data science
- Who are the most profitable customers?
Chapter 3 Predictive Modeling
Chapter 4 Fitting a Model to Data
Chapter 5 Overfitting and Its Avoidance
Chapter 6 Similarity, Neighbors and Clusters
Chapter 7 Decision Analytic Thinking I: What Is A Good Model
Chapter 8 Visualizing Model Performance
Chapter 9 Evidence and Probabilities
Chapter 10 Representing and Mining Text
Chapter 11 Decision Analytic Thinking II: Toward Analytical Engineering
Chapter 12 Other Data Science Tasks and Techniques
Chapter 13 Data Science and Business Strategy
- Management must
- Think data-analytically
- Create a culture where data science, and data scientists, will thrive
- Managers should be able to
- Manage a data science team
- asking probing questions of a data scientist
- The better you get at certain data science problems the more opportunities you see to apply them (see Amazon’s AI flywheel)
- Achieving Competitive Advantage with Data Science
- Data and data science capability are complementary
- How can data and data science provide value in the context of our business strategy
- The data (asset) must be valuable in the context of our strategy
- The data must be unique to us or we must be uniquely capable of taking advantage of it
- Is the advantage sustainable?
- Can competitors easily duplicate our data or data capabilities?
- Always be investing in new data assets and always be pursuing new capabilities
- Have a dataset or capability that is hard or expensive to replicate
- Historical advantage
- Have more data over a broader period of time
- What you can do with the data is valuable and increases customer switching costs
- Have unique intellectual property
- Unique intangible collateral assets
- Your model is not what your data scientists design, it is what your engineers implement
- Superior data scientists
- You need at least one top-notch data scientist to assess prospective hires
- Superior data science management
- understand, appreciate and anticipate the needs of the business
- communicate well with and be respected by both suits and techies
- coordinate technically complex activities
- anticipate outcomes of data science projects
- Build a great data science team — one of the best perks is working alongside other great people
- Consider PhD candidates
- Examine data science case studies
- Get ideas that are applicable to you by seeing what others have done
- Get in the habit of reading how business problems get translated into data science tasks
- Be ready to accept creative ideas from any source
- Be ready to evaluate proposals for data science projects
- Is the business problem well specified?
- Does the data science solution solve the problem?
- Is it clear how we would evaluate a solution?
- Would we be able to see evidence of success before making a huge investment in deployment?
- Does the firm have the data assets it needs? Is there labelled training data? Is the firm ready to invest in assets it does not have yet?
- Proposal in Appendix A
- How data science mature is your firm?
- low: completely ad hoc with little experience or training
- medium: well-trained scientists, managers and stakeholders that understand fundamental principles of data science
- high: work to continually improve data science processes (and not just the solution). managers work to ensure scientists are developing processes along the lines of the business strategy
Chapter 14 Conclusion
- Overall concepts
- how data science fits into the firm and competitive landscape
- How to think data analytically
- how to extract knowledge from data
- Use expected values paired with probabilities to assess costs and benefits
- Always note when the definition of a problem changes to fit the data — make sure it is understood and communicated
Appendix A