Craig's Notes

Saudi Arabia:

  • Worked at public health in Saudi Arabia
  • Population health management
  • Wanted to monitor health conditions
  • Started on Matlab
  • Developed tools to collect data
  • Created reports with data
  • Created $25 million in value with data
  • Data helped roll out better health programs
  • Now:
    • 1000 People leading teams to analyze data

Case:

  • Started as Neuroscience undergrad
  • Took classes that aligned with data science
  • Applied Math, Applied Stats, Programming
  • Talked to chair of department to create own data science major

Current Projects:

  • Local company that ships 10 million+ unique products
    • Resolve searching and indexing on website
    • Improve search results
    • Find similar matches ( 1 or 2 letter difference )
    • Find patterns in part numbers - Assign product families
  • Customer feedback
    • 1000 messages/month
    • Keywords in message
    • LDA
    • Cluster analysis on keywords
      • The more characters, the angrier they were
    • Identified distinct categories in messages based on text
    • Found 11 topics
    • Sample comments per category
  • Outcomes of clinical trial
    • Compare treatments of $2000 drug vs $50 drug
    • 2000 patients over 4 year time periods
    • Patients checked weekly and no big data analysis
    • Patients not comparable but no change it final answer
    • Study conclusions were correct

Data Scientist:

  • New field
  • What makes you one?
  • Understand data sets
  • Use skills to answer relevant problems
  • Have side projects:
    • Personal website
    • Kaggle.com
  • Skills:
    • Hacking ( Programming ), Math, Domain knowledge
    • Teamwork - Data scientist work in teams
  • Tools:
    • R Studio
    • Python
    • Spark ( R Plugin for Apache )
    • Functional Programming
      • Vectorizing operations
    • Scala
    • Database vs flat-files
  • Becoming Data Scientist:
    • Data Engineer vs Data Analyst
    • Data Engineer:
      • Programming background
    • Want to do more than make a program
    • Look into data engineering.
    • Big data tools
    • Take online courses is statistics and modeling

Quotes:

  • “Take a step back and think about the process. What is the value you’re trying to deliver?”
  • “Don’t always trust the first answer you get.” ( See data dredging )
  • “Red flags don’t always mean something is wrong.”
  • “It’s okay to be wrong.”

Data Dredging:

  • Can always find correlation in data if done enough
  • Doesn’t mean it’s true though
  • To Prevent:
    • Use 80% of data and compare to other 20%
    • Split 80% to two groups, training and cyclical
  • Compare to test data at end
  • If using test data to steer answer, you’re dredging

Cardiovascular Health Data Set:

  • Mimic 2 & 3 Data Set
  • Looked at glucose data

Jeff Leek:

  • Big Data Forgot About Statistics - Textbook
    • Failed big data projects ( IBM, Google, other big names )

Interview Questions:

  • “Best ways to represent data and find anomalies?”
    • Show me how you would think about doing it opposed to actually doing it
    • Show thought process
    • Employers looking for someone with ability to learn
    • Ask more questions about the data set
    • Anomalies over time, category, etc…
    • Drill deeper:
      • Lots of variance in data?
      • Scale
    • Simple charts with least about of information, but gets the point across
  • “Best methods to understand context of data?”
    • Go to source of data to better understand
    • Machine, person, etc…
    • Understand process of how data was generated
    • Quality assessment of data
    • Do missing fields tell you about data?
    • Drill deeper about missing data
  • “What sort of search questions should you look for when data mining?” ( i.e. healthcare IT in economics/patient feedback )
    • Ask how they do it today ( if it all )
    • Understand the process of how data is acquired
    • Do they know other people doing it?
    • Data manipulation and exploration can help form assumptions and hypothesis
  • “How important is the cloud and virtualization in big data?”
    • Depends on companies and domain
    • N.E. Ohio has more local hosts compared to cloud
    • Local servers are expensive
  • “How would you handle conflicting data?” / “What if you don’t have a data set?”
    • Client did not know where they wanted information from
    • Wanted people to take advantage of program and recruit companies
    • Did not know how, only had local companies data set
    • Wanted to identify companies that have criteria that shows growth
    • Determine key indicators
      • Job boards, news articles, social media
      • Non-traditional data, make data set
  • “How would a data scientist help further a company?”
    • Where can there be improvements/where can they do better
    • Sales, turnover, manufacturing, etc..
    • What are they doing today?
    • Identify process and understand why
  • Other Stuff
    • Data scientist must communicate well
    • Pain points, assumptions, etc…
    • Translate technical information to executives
  • Predicative/Exploratory Analysis
    • What happened and why?
    • Makes inference on data you have
    • Correlation != Causation
  • Big data potential in respect to privacy
    • Grow up in environment where digital life is not private
    • Has pros/cons
    • Social bias in machine learning
      • Actively have to weed it out
      • Google machine learning algorithms
      • Data should not have human bias
  • Open Data
    • Data is important
    • Comparable to open source code
    • Some cannot be open sourced
  • Known Unknowns in Data
    • Data found in company but not filters/known here came from
    • Data analytics is complete informations, no unknowns
    • Data science has unknowns
    • Data science down the road
    • Sees as critical as IT services are
      • Companies are starting to take inventory on data
      • Data science becomes more and more routine
        • Become trusted advisors to make better decisions
  • Lots of open data science positions in Cleveland
    • Be a data scientist, today!