Luke's Notes

Intro:
Didn’t complete data entrepreneurial pursuit
There’s 200 data professionals in the region
* -good time to be in the field
* -Started at neuroscience undergrad
* -started working on real world problem in saudi arabia
* -in house hospital network
* -Worked on “Population health management”
* -Needed to find ways to monitor their assets
* -trained in matlab

R is “today’s data science currency”
Worked at Saudi Aramco
* -Develop dashboard tools to create monthly reports
* -Over 5 years ^ used it to create 25 mil in value
* -Seeing value in data reports- IE capital
* -used applied math
* -used programming
* -wanted to find classes to complement his program

Those classes together made him stumble into data science
Created his own major in data science
6 years undergrad, some part time

Interesting projects today
* -dealing with local company that ships 10mil/week products
* * +Trying to resolve searching and indexing on web
* -helping getting a customer pulse
* -sorting outcomes of clinical trials

Skills
* -hacking skills - programming
* -science - math, analytics
* -domain - discipline

CHUG
-this monday (3/27/17)
——spark - apache tool
“Spark R enables you to leverage what you’re used to using on data science but on bigger data”

Fuzzy search:
Basically - similar matches vs exact matches
Finds words with one different letter or two different letters
Onyx = ohio’s only google partner
Part#
Used the beginning of part# to categorize which department the item belonged to
Moral: take a step back. Most of this is what you know. Recognize what is the value you’re trying to bring to the company

Used In house testing to ensure fuzzy search accuracy

Clinical trial for 2 drugs treating blindness
Drug A $50/month, drug B $2000/month
-conclusion - doesn't matter which drug
2000 patients over 2 years, 2 gigs of data, weekly encounters with each patient
-done without data science involvement at first

Were the two treatment group comparable?
* initial pass of data set revealed not comparable

Two things to remember:
1. don't always trust the first answer in data science
2. red flags doesn't mean something is wrong

“Data Dredging” - one of the most dangerous things in data science
Linear regression - fitting trend to data
Significance = derived by luck? Or meaningful

FUN FACT: Glucose does not correlate to age unless diabetic- then higher glucose with age

“If you abuse data enough, it will tell you things”
Sometimes it's okay to be wrong, don't try too hard to steer data set
“You can always lie with data statistics”

2. Nimic2 …nimic3
-specific cardiovascular data set
Once you’ve selected your data and made assumptions, move forward to experimenting on different models, go back and define parameters, as long as there's an original data set you haven’t touched
Once you’ve compared to test data set, you’ve created bias, don’t go back on that.
Fine tune parameters = complete data set -> split into training set, test data set
Once you compare test data set, can’t go back

Big data projects often fail because they forget about statistics

Q. What would you say to a software engineer to pursue data science?
* A. -the ideal data scientist = hacking, science, domain skills = unicon
* -data engineer vs data scientist/analyst
* * Data engineer = programming background, java, hadoop, software background
* * Recommended first step for data engineer = looking into data engineering
* Take online courses in statistics and modelling
* Exposed to constraints of data science
* Data scientist = ?

EXERCISE TIME
Come up with interview questions

Q. Best way to find anomalies in dataset??
* A. important to employers
* Isn't can you do it
* It is — can you show me how you think about doing it
* Even if you can't do it, you get nice brownie points for explanation
* Employers want people that has “enough of the right grounding, mindset, and ability to Learn”

* A: as a data scientist it's your job to ask more questions
* Over time?
* Quality of data set?
* In relation to categories?
* Naming convention?
* Numeric, categorical, time series
* Does it have a lot of variants? Create a visual for that?
* Choose a simple chart with the least amount of data on it to highlight anomalies

Q. When approaching a new data set, what are best methods to understand the dataset to accurately describe what you’re working with
* A. Depends on how dataset was generated. By machine? Or by a person

Q. Give an example of machine learning
* Used it for customer feedback.
* Org got close to 1k different messages every month
* Couldn't do it with 1 person to look at it all the time
* Categorize messages by keywords/tags
* Applied LDA = Latent Dirichlet allocation = cluster analysis on words
* Natural language processing - LDA
* * -used it to define 11 different categories
* * -machine learning is able to find things that human may not catch
* * -customer relation with company

Q. What sort of search question would you want to look for when doing data mining for a big data project
* A. Can you be more specific?

Q. what is the general search questions when doing data mining
* A. Example: HIT company we need to focus on economic factor + patient relations
* Goes back to process…great question to ask is “how do you do it today” the answer is often “we’re not doing it, or we’d like to be doing it today” ask “do we
* know of other people doing it ..they’ve seen someone else doing it”
* Once you’ve traced it back to source data, then you can begin forming hypotheses and data mining

Q. how important is the cloud and virtualization when it comes to big data
* A. Depends on the company
* Either local, or on the cloud
* In NE ohio, they like to be local, especially on the larger size
* In HIT - default is all machines are virtualization
* Virtualization is like the concept of chromebook
* * -you have a machine you are working with but the computing hardware is not there. IE - plugged into some central resource
* * -Virtualization causes challenges when installing applications
* * Local is expensive

Q. how would you handle conflicting data
* A. Project - client had no idea where they wanted to get info from
* Nonprofit with a grant - fully funded - trying to recruit companies to participate
* Data set with local companies.. How to sort them by interest in their program
* Sort by companies that meet criteria such as growth trends

* “Data scientist knows that there is data out there that can be found”

* You can make up a data set if you don’t have one

Q. how would you as a data scientist help further our success if we gave you all our data including sales in stores / online
* A. It all starts with where does it occur today. All companies are hurting in some area, or they have an area where they’d like to improve
* Compare to what are you doing today.
* Often the area is hurting because they are not being reported on
* A lacking question was working in teams, which is very important in this field
* Data scientist is a very special role where you report to a high up business officer, like VP of finance and not so much as reporting to the IT ladder

Q. tips for predictive analytics to find new trends in markets

  • A. Some questions you should not try to answer
    • Arrow graphic he calls “The three stages of question”

* “Correlation is not causation”
* Just because you can predict something doesn't mean you can answer what's going to change the course

Q. what are your thoughts on big data’s potential with respect to privacy and the internet bubble? How are you classified by ..facebook? Amazon’s alexa? How are things we see on the internet tailored to us as individuals?
* A. At a high level - to avoid ethics - we are growing up in an environment where our visual lives are no longer private. Perks = ads, content tailored to you,
* things that you love to see.
* Examples of bias and machine learning:
* An article summer2k16 alot of social bias exists in machine learning…without weeding that out, machine learning can do a lot of harm
* Alabama was going to use AI to decide rulings on court cases

Q. what other tools should data scientist be familiar with
* So many tools… python rapid miner, tablo?, r, functional programming is becoming more and more important. Applying single operations to multiple data sets.
* Functional languages - R, commands like apply - vectorizing operations. Understanding scala will get you very far

Q. how do you know that you have the skills necessary to be a data scientist
* A. You can qualify yourself. You are a pro if you understand how to work with data sets. You understand the basics of data curation. You can apply math and
* stats to inform data set creators. How do you back it up? By having portfolios of projects on linkedin - other sites. Put “Seeking roles as data scientist” as your
* header

Q. opinion on open data?
* A. Open data is really important. Akin to the open source movement with software. There's always data in your organization that can’t be open sourced. Highly
* encouraged to use open data to make your work environment better

Q. how do you find known-unknowns within your company
* A. Data analytics is what you do with data …known/unknown
* Data science is you know that you don't know things
* The best way to move forward is to document assumptions, truths, uncertainties..
* Always check this while you work through the data
* Huge overlap in data engineering, data analytics, data science
* data science begins with “the answer that you’re arriving at depends on making assumptions and structuring the data beyond what you’re getting.”

Q. what do you see data science in 10 years
* A. Data science is as critical as IT services

Q. specific strategies vs universal strategies when mining through a data set
* the beginning is universal. Learning how to ask the right questions, understand assumptions, where did the data come from
* project specific is technical stuff. R vs Python. Does this require big data. Does 80 - 20 work? Will i need to do some modeling?