Data Science For Business

Quotes worth referencing:

  1. p. 6 They show that statistically, the more data driven a firm is, the more productive it is.
  2. p. 9 What can I now do that I couldn’t do before, or do better than I could before?
  3. p. 13 Understanding data science is critical, because “unlike other technical projects, data science is supporting improved decision-making.”
  4. p. 29 Customer records and product identifiers are notoriously variable and noisy. Cleaning and matching customer records to ensure only one record per customer is itself a complicated analytics problem
  5. p 341 Data incorporate the beliefs, purposes, biases, and pragmatics of those who designed the data collection systems

Types of applications:Two broad types:

  1. Decisions for which “discoveries” need to be made within data and 
  2. decisions that repeat, especially at massive scale, so decision-making can benefit from even small increases in decision-making accuracy based on data analysis
  3. Marketing
    1. Customer retention
    2. Targeted marketing
    3. recommendation engines
  4. Sales
    1. Suggestions for what works (beyond intuition)

Chapter 1: Intro

  1. Data Science is the fundamental principles used to extract knowledge from data
  2. Use to discover unusual events, things outside the normal pattern of data
  3. Ultimate goal is improving decision making
  4. Structure:
    1. Data engineering and processing -> data science -> automated data-driven decisions -> data-driven decisions made across a firm
  5. Two broad types of data science
    1. Discovering insights
    2. Automating small improvements in a repeatable scalable way
  6. Big data is not just data science
    1. Just utilizing a large dataset isn’t data science, it’s extracting intelligence that matters
  7. Data and data scientists are complementary
    1. data worthless without a team that can use it
    2. data scientists worthless without data
  8. Thing of data as a business asset, as in, it can be worth paying to acquire if you can get value out of it
    1. go by the data you need
    2. spend time losing money to get to profits in the future
  9. Understanding data science is critical, because “unlike other technical projects, data science is supporting improved decision-making.”
  10. Extracting useful, non-trivial, and hopefully actionable insights from large datasets
  11. Fundamental concepts:
    1. “Extracting useful knowledge from data to sole business problems can be treated systematically by following a process with reasonable well-defined stages.”
      1. structured thinking about analytics emphasizes careful analysis of a problem
    2. “From a large mass of data, information technology can be used to find informative descriptive attributes of entities of interest”
    3. “If you look too hard at a set of data, you will find something — but it might not generalize beyond the data you’re looking at.” overfitting
    4. “Formulating data minin solutions and evaluating the results involves thinking carefully about the context in which they will be used.”

Chapter 2: Business Problems and Data Science Solutions

  1. A process with well understood stages
  2. closer to systemic analyses rather than heroic endeavors driven by chance and individual acumen
  3. Decompose a business problem it to a set of subtasks, then figure out how to do those subtasks
  4. example: churn
    1. what in our historical data makes it likely someone will terminate their contract with us after it is up
    2. once you identify those things, either predict what causes those things or develop business solutions
  5. key skill is breaking problems down into subtasks that have known solutions
  6. Various data mining tasks
    1. classification and class probability estimation
      1. predict, for each individual in a population, which of a small set of classes this individual belongs to
      2. classes are mutually exclusive
    2. Regression (“value estimation”)
      1. attempts to predict the numerical value of some variable for the individual
        1. service usage
        2. money spent
      2. classification is will this happen and regression is how much will it happen
    3. Similarity matching
      1. attempts to identify similar individuals based on data known about them
      2. customers similar to you also bought x
    4. Clustering
      1. Group individuals in a population together by their similarity
      2. what natural groups exist
      3. groups things together based on their own attributes
    5. co-occurrence grouping
      1. aka frequent itemset mining, association rule discovery, or market-basket analysis)
      2. attempts to find associations between entities based on transactions involving them
      3. groups things together because they occur in the same event together
        1. i.e. basket
        2. transaction
    6. Profiling (behavior description)
      1. attempts to characterize the typical behavior of an individual, group, or population
      2. i.e. what is the typical usage of this customer segment?”
      3. often used to establish norms for anomaly detection applications
    7. link prediction
      1. attempts to predict connections between data items
      2. suggests that a link should exist with some probability
      3. since you and karen share 10 friends perhaps you should be karen’s friend
    8. data reduction
      1. taking a large dataset and replacing it with a smaller one of just key attributes
      2. trade-offs need to be assessed
    9. causal modeling
      1. attempts to help understand what events or actions actually influence others
      2. tries to analyze what would happen when the thing does or doesn’t happen
        1. i. e. did or didn’t see an ad
  7. Supervised versus unsupervised methods
    1. unsupervised: haven’t defined the item of interest
      1. do our customers fall into different groups
    2. supervised: have defined the item of interest
      1. can we find groups likely to do x
  8. There’s a difference between mining the data to find patterns and build models and using the results of data mining.
  1. The data mining process
  1. Business understanding
    1. may need to reformulate several times to truly understand the problem
    2. critical to successfully cast the business problem as one more more data science problems
    3. think carefully about the problem to be solved and the use scenario
      1. what exactly do we want to do
      2. how do we want to do it
      3. which parts represent possible data mining models
  2. Data Understanding
    1. Data is the raw material from which the solution will be built
    2. Understand the strengths and limitations of the data
    3. Historical data are often collected for unrelated purposes or no explicit purpose
      1. customer database
      2. transaction database
      3. marketing response database
    4. What’s the cost of acquiring the data
      1. free?
      2. effort?
      3. purchased?
      4. costs and benefits of data sources, is further effort worthwhile
  3. Data Preparation
    1. Data needs to be in a format differently than it “appears naturally”
      1. converted to tabular format
      2. remove or infer missing data
      3. convert to different types
        1. i.e. category to binary
    2. Must be sure data will be available at the point of decision making – “leakage”
      1. i.e. historical dataset can predict x from y, but will you know y at the point in time it is useful to know
  4. Modeling
    1. The output of a model or pattern capturing regularities in the data
  5. Evaluation
    1. Assess results rigorously and gain confidence that they are valid and reliable
    2. Ensure model satisfied original business goals
    3. Once deployed, is the data useful? Too many false alarms?
    4. Will the model do more good than harm?
    5. Will the model avoid catastrophic failures?
    6. Will it be comprehensible?
  6. Deployment
    1. Putting the results into real use
      1. eg send offers to customers who are predicted to cancel contracts
    2. Can deploy the model or the data mining technique itself
    3. Model:
      1. Create some new programmatic event based on the model
      2. Create a new marketing or business strategy based on the model
    4. Data mining itself deployed:
      1. Data science team creates working model
      2. It is coded by production development team into production
  7. Iterate
    1. The process itself creates a lot of insights, and can then be used to take things further
  8. Implications for managing the data science team
    1. Data mining is closer to R&D than it is to engineering, so the typical software development cycle is not as applicable
      1. CRISP cycle is based on exploration, approaches and strategy
      2. outcomes less certain
      3. learnings at any one step may need to change process
    2. Software skills vs. analytics skills
      1. Software: write efficient, high quality code from requirements
      2. Analytics: formulate problems well, prototype solutions quickly, make reasonable assumptions in the face of ill-structured problems, design experiments that represent good investments, and analyze results
  9. Other analytics techniques
    1. Statistics
      1. measuring and comparing the distribution of attributes within a population
      2. Assessing statistical probabilities of certain things
      3. Testing hypotheses (data mining is as much about generating hypotheses to test)
    2. Database querying
      1. Translating an idea or question into a machine-readable format to output a desired set of information, like most expensive houses in new york owned by mothers under 30
    3. Data warehousing
      1. collect and coalesce data from across an enterprise, often from multiple different systems
      2. data collected in one warehouse can be easier to use than data in siloes throughout the business
    4. Regression analysis
      1. Looking at historical data to assess how given customers would act according to that historical data
      2. (vs. data mining regression, which is trying to predict what a new customer would do given past information and how generalizable it may or may not be)
    5. Machine learning
      1. Analyzing data from the environment and making predictions about unknown quantities
      2. subfield of AI
  10. Answering business questions with data science
    1. Who are the most profitable customers?

Chapter 3 Predictive Modeling

Chapter 4 Fitting a Model to Data
Chapter 5 Overfitting and Its Avoidance
Chapter 6 Similarity, Neighbors and Clusters
Chapter 7 Decision Analytic Thinking I: What Is A Good Model
Chapter 8 Visualizing Model Performance
Chapter 9 Evidence and Probabilities
Chapter 10 Representing and Mining Text
Chapter 11 Decision Analytic Thinking II: Toward Analytical Engineering
Chapter 12 Other Data Science Tasks and Techniques
Chapter 13 Data Science and Business Strategy

  1. Management must
    1. Think data-analytically
    2. Create a culture where data science, and data scientists, will thrive
  2. Managers should be able to
    1. Manage a data science team
    2. asking probing questions of a data scientist
  3. The better you get at certain data science problems the more opportunities you see to apply them (see Amazon’s AI flywheel)
  4. Achieving Competitive Advantage with Data Science
    1. Data and data science capability are complementary
    2. How can data and data science provide value in the context of our business strategy
    3. The data (asset) must be valuable in the context of our strategy
    4. The data must be unique to us or we must be uniquely capable of taking advantage of it
  5. Is the advantage sustainable?
    1. Can competitors easily duplicate our data or data capabilities?
    2. Always be investing in new data assets and always be pursuing new capabilities
    3. Have a dataset or capability that is hard or expensive to replicate
    4. Historical advantage
      1. Have more data over a broader period of time
      2. What you can do with the data is valuable and increases customer switching costs
    5. Have unique intellectual property
    6. Unique intangible collateral assets
      1. Your model is not what your data scientists design, it is what your engineers implement
    7. Superior data scientists
      1. You need at least one top-notch data scientist to assess prospective hires
    8. Superior data science management
      1. understand, appreciate and anticipate the needs of the business
      2. communicate well with and be respected by both suits and techies
      3. coordinate technically complex activities
      4. anticipate outcomes of data science projects
    9. Build a great data science team — one of the best perks is working alongside other great people
      1. Consider PhD candidates
  6. Examine data science case studies
    1. Get ideas that are applicable to you by seeing what others have done
    2. Get in the habit of reading how business problems get translated into data science tasks
  7. Be ready to accept creative ideas from any source
  8. Be ready to evaluate proposals for data science projects
    1. Is the business problem well specified?
    2. Does the data science solution solve the problem?
    3. Is it clear how we would evaluate a solution?
    4. Would we be able to see evidence of success before making a huge investment in deployment?
    5. Does the firm have the data assets it needs? Is there labelled training data? Is the firm ready to invest in assets it does not have yet?
  9. Proposal in Appendix A
  10. How data science mature is your firm?
    1. low: completely ad hoc with little experience or training
    2. medium: well-trained scientists, managers and stakeholders that understand fundamental principles of data science
    3. high: work to continually improve data science processes (and not just the solution). managers work to ensure scientists are developing processes along the lines of the business strategy

Chapter 14 Conclusion

  1. Overall concepts
    1. how data science fits into the firm and competitive landscape
    2. How to think data analytically
    3. how to extract knowledge from data
  2. Use expected values paired with probabilities to assess costs and benefits
  3. Always note when the definition of a problem changes to fit the data — make sure it is understood and communicated

Appendix A