Becoming a data scientist

2019-03-27 00:48|来源: 网路

Data Week: Becoming a data scientist

Data Pointed, CouchDB in the Cloud, Launching Strata


How do I become a data scientist?

Background: I recently finished my bachelor's degree in computer science at Berkeley. Although it may be a bit late, I am just now getting interested in learning more about statistics and "data science." Unfortunately, I don't have much of a math background (only took up to Linear Algebra) and the required probability/discrete math course for CS. Although I started working, I have the option of enrolling in an MS CS program in January. What courses should I be looking at and will a MS in Statistics be more useful? If so, is it possible to get into an MS in Statistics without a strong math background? I will probably be looking into taking machine learning and data visualization.

Strictly speaking, there is no such thing as "data science" (see  What is data science? ). See also: Vardi, Science has only two legs:
Here are some resources I've collected about working with data, I hope you find them useful  (note: I'm an undergrad student, this is not an expert opinion in any way).

1) Learn about matrix factorizations:

Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numeric Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix  decomposition algorithms are fundamental to many data mining applications and usually underrepresented in a standard "machine learning" curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites.  I'd recommend these resources for self study/reference material:

2) Start learning statistics by coding with R: 
  • Pick up some R manuals (see
What are essential references for R?) and experiment with some of these data sets:
and UCI Machine learning repository:
  • Here is a good reference to get started with regression analysis: 
Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models:
  •  Albert, Bayesian computation with R:
  • Spector, Data Manipulation with R:

3)  Learn about distributed systems and databases:
  • Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog. I believe it is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data. It is also becoming increasingly important to be able to utilize the full power of multicore. (see ,
  • Download Hadoop [8] and run some MapReduce jobs on your laptop in pseudo-distributed mode (see 
What's the best way to come up to speed on MapReduce, Hadoop, and Hive? )
  • Learn about Google technology stack (MapReduce, BigTable, Dremel, Pregel, GFS, Chubby, Protobuf etc). (See 
What are the most interesting Google Research papers? 
also and,
  • Setup account with Amazon AWS/EC2/S3/EBS and experiment with running Hadoop on a cluster with large data sets (you can use Cloudera or YDN images, but in my opinion you can better understand the system if you set it up from scratch, using the original distribution). Watch the costs.
What are some promising open-source alternatives to Hadoop MapReduce for map/reduce? )

4)  Learn about data compression
 To be added
5)  Learn about machine learning

What are the hottest startups in the analytics space?  
Who are the best VCs in the field of analytics / data mining / databases?
Which companies have the best data science teams? 
What are the notable startups in the news space?
Does the US Census have a data team?  
Why do so many data geeks join web companies instead of solving large scale data problems in biology?  

6)  Learn about least-squares estimation and Kalman filters:

  • This is a classic topic and "data science" par excellence in my opinion. It is also  a good introduction to optimization and control theory. Start with Bierman's LLS tutorial given to his colleagues at JPL, it is clearly written and is inspiring (the Apollo mission trajectory was estimated using these methods): , also see Curkendall & Leondes: and Quarles:
  • See Steven Kay's series on statistical signal estimation:, also check out his short course outline at University of Rhode Island for a list of interesting topics to learn (this is usually part of EE curricula):

7) Check out these Q&A:

What are the best blogs about data?  
What are the best Twitter accounts about data? 
What are the best blogs about bioinformatics? 
What are the best Twitter accounts about bioinformatics? 
What is data science?
What are the best courses at MIT?  
What are the best resources to learn about web crawling and scraping? 
What are the best interview questions to evaluate a machine learning researcher? 
What are the best resources for learning about distributed file systems?
What are some useful packages for working with large datasets in R? 
What are some good books on stringology and pattern matching?  
What's a good introductory machine learning text? 
What is the best book to pick up working knowledge of theoretical statistics (assuming strong general math)? 
Can anyone recommend a fantastic book on time series analysis? 
What are the standard texts on linear regression? 
What are some good books on random processes? 
How has BigTable evolved since the 2006 Google paper? 
What is a good source for learning about Bayesian networks? 
What are the best data visualizations ever created? 
What are some of the prediction and risk estimation models used by insurance companies? 
How do scientists share data? 
What are the best quant hedge funds? 
What are the best books on econometrics? 
What are the best introductory books on mathematical finance? 
What is the best approach for text categorization?
What are the numbers that every engineer should know, according to Jeff Dean?

If you do decide to go for a Masters degree:

8) Study Engineering - I'd go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a "data scientist" you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 2 above) or take some statistics classes as a part of your CS studies. 

Good luck.

Alex Kamil
Peter Skomoroch, Sr. Data Scientist @ Linkedin - ... 19 endorsements
If you have the time to take courses, give it a shot.

1) Try to take some of the undergrad math courses you missed. Linear Algebra, Advanced Calculus, Diff. Eq., Probability, Statistics are the most important.  After that, take some Machine Learning courses.  Read a few of the leading ML textbooks and keep up with journals to get a good sense of the field.

2) Read up on what the top data companies are doing.  After 1 or 2 machine learning courses you should have enough background to follow most of the academic papers.  Implement some of these algorithms on real data.

3) If you are working with large datasets, get familiar with the latest techniques & tools (Hadoop, NoSQL, R, etc.) by putting them into practice at work (or outside of work).

Read these posts by Mike Driscoll:

Peter Skomoroch
I am currently working as a data engineer with a team of others and I can tell you what we all have in common:

1) MS or PhDs in Applied Mathematics or Electrical Engineering
2)  Fluency C++/Matlab/Python
3)  Experience building distributed systems and algorithms.

I agree with Anon that CS is probably not the way to go unless you are going to MIT, Caltech, Stanford, CMU, etc. The way I ended up in the field was working as a software engineer designing real-time systems and getting a MS in Applied Math part-time. After 4 years I had skills from both fields and was offered a position doing ML/DM. With that said, I can tell you that its an extremely interesting field, and it appears the skill set will only become more desirable in the future.
Joseph Misiti
A good start for becoming a data scientist is to get MS (or PhD) in Machine Learning / Data Mining - along the way you will get plenty of experience in relevant math and use latest systems.  Stanford, UCI, CMU, MIT are top schools, but there are many others in USA - see and in Europe

Stanford has online courses in data mining / ML - check
Gregory Piatetsky
Russell Jurney, Data Viznik, Hack Historian 2 endorsements
The school route is well covered.  This is the autodidactic route:

Look at some common problems solved with machine learning.  Look at problems in your areas of interest with an abundance of available data. Intersect these sets, pick a problem to solve with ML. Learn whatever it takes to solve it poorly.  Get people using the output of your model. Iterate, learn more techniques.  Work on your maths as needed.  Find mentors to talk with about problems you're working on.  Keep them updated, collaborate, learn from them.

Get good at building things with data.  Update your LinkedIn profile - congratulations, you're a data scientist!
Russell Jurney
Paco Nathan, 45 years ago I couldn't even spe... 4 endorsements
Stanford has an interdisciplinary degree specifically for data science, called Mathematical and Computational Sciences (MCS). It's sponsored by the Stats department and overlaps with CS, Math, Operations Research, etc.  The BS degree dovetails particularly well with a co-term program to get an MS in Computer Science -- say, with a distributed systems specialization.

+1 to both Pete's and Russ' wise words above.
Paco Nathan
Yaniv Goldenrand, Fraud and credit modeling
3 votes by Alex Kamil, Kevin Li and Seb Paquet
Get a job doing it, this way you'll learn what really matters and get paid in the process.
The standard way to become a data analyst is master's in math/statistics + internship.
Other ways are:
- PhD in some empirical subject (economics, psychology).
- Get an engineering position in some data-intensive company and convert.
Some of the best modelers I know are ex-programmers.
Yaniv Goldenrand
Reading data mining related blogs is also important to understand the wide application areas of data mining. You have a list of data mining blogs here:
Sandro Saitta
1) infrastructure of data processing, such as Hadoop/MapReduce,  Pig/Hive, and automation/cron.
2) simple stats about data, such as mean, correlation, and p-value.
3) algorithms for data modeling,  such as logistic regression, and SVM.
4) visualization of data, such as chart and table.





使用python从文档中剥离(XML?)标记(Stripping (XML?) markup from a document using python)

如上所述,这似乎是xml。 在这种情况下,您应该使用xml解析器来解析此文档; 我推荐lxml( )。 根据您的要求,您可能会发现使用SAX样式解析而不是DOM样式更方便,因为SAX解析只涉及在解析器遇到特定标记时注册处理程序,只要标记的含义不依赖于上下文,并且您要处理多种类型的标记(这可能不是这种情况)。 如果您的输入文档可能不正确,您可能希望使用Beautiful Soup: http : // ...

数据科学家的基本技能[关闭](Essential skills of a Data Scientist [closed])

引用Hadley的博士论文 : 首先,您可以使用您可以使用的表单获取数据...其次,绘制数据以了解发生的情况...第三,在图形和模型之间进行迭代,以构建简洁的定量摘要数据...最后,你回顾一下你做了什么,并考虑到将来要做的更好的工具 步骤1几乎肯定涉及数据暴露,可能涉及数据库访问或网页刮擦。 知道创建数据的人也很有用。 (我在“网络”下提交。) 步骤2意味着可视化/绘图技巧。 步骤3表示统计或建模技能。 由于这是一个愚蠢的广泛类别,所以委派给建模者的能力也是一个有用的技能。 最后一步主要是关于软性 ...

在列表中显示mongo中的相应字段(display corresponding fields in mongo in a list)

对于MongoDB 2.6到3.2版本,您需要一些来自$map帮助: db.Books.aggregate([ { "$match":{ "timestamp":{ "$gte": "2016-04-08 19:46:03", "$lt": "2016-04-08 19:46:06" } }}, { "$group": { "_id": "$company", "count": { "$sum": 1 ...

在mySQL中不显示年份值(Not showing Year values in mySQL)

在文本编辑器中打开您的csv,以便查看格式(封闭器,分隔符)是否一致。 通常情况下,您应该能够在某些字段中看到双引号。 双引号是enclosed by '"'你的语句中缺少enclosed by '"'行所enclosed by '"'但是是在mySQL中解析CSV的默认值。 I have found the answer, the problem was with the format of cell contents like text or number or General etc. So ...

关闭屏幕上的Python龟错误 - 如何像计算机科学家一样思考的代码:使用Python学习3(Python turtle error on closing screen - code from How to Think Like a Computer Scientist: Learning with Python 3)

turtle.Screen()返回的对象是一个单例,因此您的代码正在积极地对抗模块设计。 根据文档,您应该在应用程序中使用RawTurtle实例。 import turtle import time import tkinter as tk def show_poly(): try: n = int(input("How many sides do you want in your polygon?")) angle = 360 / n ...

在“思考Python:如何像计算机科学家一样思考”中有更好的算法9.3(Is there a better algorithm for exercise 9.3 in 《Think Python: How to Think Like a Computer Scientist》)

首先,让我们更简洁地重写它 def contain(word, letters): return any(letter in word for letter in letters) def ncont(words, letters): return sum(contain(word, letters) for word in words): 目前,您的算法具有平均复杂度 O(len(letters) * len(a_word) * len(words)) ---+----- ...

R的等价物与Python中的条件相当(Equivalent of R's sapply with a condition In Python)

要按类型从DataFrame列中删除,您可以在数据框上使用函数“select_dtypes”: data.select_dtypes(exclude=['bool']) To remove from a DataFrame columns by type you can use function 'select_dtypes' on your dataframe: data.select_dtypes(exclude=['bool'])

Do-While循环填充数组(Do-While loop to fill out Array)

总是喜欢使用nextLine()读取输入,然后解析字符串。 使用next()只会返回空格之前的内容。 返回当前行后, nextLine()自动将扫描仪向下移动。 从nextLine()解析数据的有用工具是str.split("\\s+") 。 public class Scientist { private String name; private String field; private String greatIdeas; ...