(翻译)30天学习Python👨💻第二十七天——机器学习 & 数据科学(一)
现在是时候深入一些真正的机器学习和数据科学的编码了。今天我主要集中在开始使用 Jupyter Notebook 工作流程,并且创建一个基础项目以理解它是如何工作的。最终搜索一些数据集,然后遵循机器学习的基本原则,从中生成有用的信息。我也会分享我创建的 notebook。Jupyter Notebooks 最有用的是像一篇博客或文章一样和交互式代码、数据和其他信息一起进行组织。
使用Jupyter Notebooks工作
我想提供一些很棒的参考来理解 Jupyter Notebook 的界面、安装指南以及工作流程。
- Jupyter Notebook Tutorial Video
- Installation guideline - (建议使用Anaconda toolkit安装它,因为它附带了很多有用的工具)
- Documentation
因为我是windows用户,我想提供一个简单的提示:
在windows中,从开始菜单打开 Anaconda 提示符,导航到你想创建 jupyter项目的目录,然后运行
jupyter notebook命令,它就会在你的浏览器中打开notebook了。
根据机器学习和数据科学基础步骤,我们将创建一个项目并且创建一个可读的 notebook 记录整个过程,它可以被分享给任何人。
使用Netflix展示基础数据科学和机器学习项目
注:Netflix是一家在线影片租赁提供商
机器学习和数据科学的基本步骤:
- 从资源中导入数据
- 清理数据,如果需要的话,删除任何不相关的数据
- 将数据分为训练数据和测试数据
- 创建一个模型或者算法或者函数
- 检查输出
- 改进并重复上面的步骤
在这个基础项目中我们会探索前两个步骤
1.导入并操作数据
对于机器学习和数据科学来说首先的和最重要的事情是数据本身。想要得到有意义的结论,我们必须有好的数据集。这种输入数据可以通过多种方式收集——从数据库收集,从网络爬取,公开的API或者分享的数据集。
Kaggle 是一个深受机器学习和数据科学爱好者欢迎的网站,在这里可以找到大量的公开共享数据集。
我决定搜索Netflix的节目数据集,并从 Kaggle - www.kaggle.com/shivamb/net… 找到了这个。它包含将用于该项目的CSV格式的数据。下载完文件后,可以将其放在项目的根目录中。我将它命名为 netflix_titles.csv。
由于该数据是一种表格格式,也就是说它是按行和列排列的, pandas 是一个很好的处理和分析这类数据的开源库。它与Anaconda工具包一起提供,因此可以直接在笔记本上使用。
import pandas as pd
data_frame = pd.read_csv('netflix_titles.csv')
data_frame.head(10) # show first 10 results
# prints the data frame in as a table
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 81145628 | Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole... | United States, India, South Korea, China | September 9, 2019 | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies | Before planning an awesome wedding for his gra... |
| 1 | 80117401 | Movie | Jandino: Whatever it Takes | NaN | Jandino Asporaat | United Kingdom | September 9, 2016 | 2016 | TV-MA | 94 min | Stand-Up Comedy | Jandino Asporaat riffs on the challenges of ra... |
| 2 | 70234439 | TV Show | Transformers Prime | NaN | Peter Cullen, Sumalee Montano, Frank Welker, J... | United States | September 8, 2018 | 2013 | TV-Y7-FV | 1 Season | Kids' TV | With the help of three human allies, the Autob... |
| 3 | 80058654 | TV Show | Transformers: Robots in Disguise | NaN | Will Friedle, Darren Criss, Constance Zimmer, ... | United States | September 8, 2018 | 2016 | TV-Y7 | 1 Season | Kids' TV | When a prison ship crash unleashes hundreds of... |
| 4 | 80125979 | Movie | #realityhigh | Fernando Lebrija | Nesta Cooper, Kate Walsh, John Michael Higgins... | United States | September 8, 2017 | 2017 | TV-14 | 99 min | Comedies | When nerdy high schooler Dani finally attracts... |
| 5 | 80163890 | TV Show | Apaches | NaN | Alberto Ammann, Eloy Azorín, Verónica Echegui,... | Spain | September 8, 2017 | 2016 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, Spanis... | A young journalist is forced into a life of cr... |
| 6 | 70304989 | Movie | Automata | Gabe Ibáñez | Antonio Banderas, Dylan McDermott, Melanie Gri... | Bulgaria, United States, Spain, Canada | September 8, 2017 | 2014 | R | 110 min | International Movies, Sci-Fi & Fantasy, Thrillers | In a dystopian future, an insurance adjuster f... |
| 7 | 80164077 | Movie | Fabrizio Copano: Solo pienso en mi | Rodrigo Toro, Francisco Schultz | Fabrizio Copano | Chile | September 8, 2017 | 2017 | TV-MA | 60 min | Stand-Up Comedy | Fabrizio Copano takes audience participation t... |
| 8 | 80117902 | TV Show | Fire Chasers | NaN | NaN | United States | September 8, 2017 | 2017 | TV-MA | 1 Season | Docuseries, Science & Nature TV | As California's 2016 fire season rages, brave ... |
| 9 | 70304990 | Movie | Good People | Henrik Ruben Genz | James Franco, Kate Hudson, Tom Wilkinson, Omar... | United States, United Kingdom, Denmark, Sweden | September 8, 2017 | 2014 | R | 90 min | Action & Adventure, Thrillers | A struggling couple can't believe their luck w... |
data_frame.info()
# shows information about column data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 6234 non-null int64
1 type 6234 non-null object
2 title 6234 non-null object
3 director 4265 non-null object
4 cast 5664 non-null object
5 country 5758 non-null object
6 date_added 6223 non-null object
7 release_year 6234 non-null int64
8 rating 6224 non-null object
9 duration 6234 non-null object
10 listed_in 6234 non-null object
11 description 6234 non-null object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB
data_frame.shape
# provides information of rows and columns as a tuple
(6234, 12)
data_frame.describe()
# shows some basic description
| show_id | release_year | |
|---|---|---|
| count | 6.234000e+03 | 6234.00000 |
| mean | 7.670368e+07 | 2013.35932 |
| std | 1.094296e+07 | 8.81162 |
| min | 2.477470e+05 | 1925.00000 |
| 25% | 8.003580e+07 | 2013.00000 |
| 50% | 8.016337e+07 | 2016.00000 |
| 75% | 8.024489e+07 | 2018.00000 |
| max | 8.123573e+07 | 2020.00000 |
data_frame['title'].head() # lists a specific column data with first 5 entries (head)
0 Norm of the North: King Sized Adventure
1 Jandino: Whatever it Takes
2 Transformers Prime
3 Transformers: Robots in Disguise
4 #realityhigh
Name: title, dtype: object
# Filtering Data
data_frame[data_frame['country'] == 'India'].head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35 | 81154455 | Movie | Article 15 | Anubhav Sinha | Ayushmann Khurrana, Nassar, Manoj Pahwa, Kumud... | India | September 6, 2019 | 2019 | TV-MA | 125 min | Dramas, International Movies, Thrillers | The grim realities of caste discrimination com... |
| 37 | 81052275 | Movie | Ee Nagaraniki Emaindi | Tharun Bhascker | Vishwaksen Naidu, Sushanth Reddy, Abhinav Goma... | India | September 6, 2019 | 2018 | TV-14 | 133 min | Comedies, International Movies | In Goa and in desperate need of cash, four chi... |
| 41 | 70303496 | Movie | PK | Rajkumar Hirani | Aamir Khan, Anuskha Sharma, Sanjay Dutt, Saura... | India | September 6, 2018 | 2014 | TV-14 | 146 min | Comedies, Dramas, International Movies | Aamir Khan teams with director Rajkumar Hirani... |
| 58 | 81155784 | Movie | Watchman | A. L. Vijay | G.V. Prakash Kumar, Samyuktha Hegde, Suman, Ra... | India | September 4, 2019 | 2019 | TV-14 | 93 min | Comedies, Dramas, International Movies | Rushing to pay off a loan shark, a young man b... |
| 99 | 80225885 | TV Show | Bard of Blood | NaN | Emraan Hashmi, Viineet Kumar, Sobhita Dhulipal... | India | September 27, 2019 | 2019 | TV-MA | 1 Season | International TV Shows, TV Action & Adventure,... | Years after a disastrous job in Balochistan, a... |
# Sorting Data
data_frame.sort_values('release_year', ascending=False).head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3467 | 81011449 | TV Show | Medical Police | NaN | Erinn Hayes, Rob Huebel, Malin Akerman, Rob Co... | United States | January 10, 2020 | 2020 | TV-MA | 1 Season | Crime TV Shows, TV Action & Adventure, TV Come... | Doctors Owen Maestro and Lola Spratt leave Chi... |
| 3249 | 81006825 | Movie | All the Freckles in the World | Yibrán Asuad | Hánssel Casillas, Loreto Peralta, Andrea Sutto... | Mexico | January 3, 2020 | 2020 | TV-14 | 90 min | Comedies, International Movies, Romantic Movies | Thirteen-year-old José Miguel is immune to 199... |
| 3220 | 80997687 | TV Show | Dracula | NaN | Claes Bang, Dolly Wells, John Heffernan | United Kingdom | January 4, 2020 | 2020 | TV-14 | 1 Season | British TV Shows, International TV Shows, TV D... | The Count Dracula legend transforms with new t... |
| 3427 | 81060049 | Movie | Leslie Jones: Time Machine | David Benioff, D.B. Weiss | Leslie Jones | United States | January 14, 2020 | 2020 | TV-MA | 66 min | Stand-Up Comedy | From trying to seduce Prince to battling sleep... |
| 3436 | 80239306 | TV Show | The Healing Powers of Dude | NaN | Jace Chapman, Larisa Oleynik, Tom Everett Scot... | NaN | January 13, 2020 | 2020 | TV-G | 1 Season | Kids' TV, TV Comedies, TV Dramas | When an 11-year-old boy with social anxiety di... |
这是一个很好的Python数据科学备忘单,它列出了所有常用的Pandas方法和属性以及其他用于数据科学的库。
2.数据清理
下一步是清理数据并删除分析中不需要的任何类型的信息。
让我们考虑一个示例用例,在这个用例中,我们希望找到适合所有年龄层(TV-G级别)的 Netflix 喜剧电影和节目。
# Let's select the relevant columns for analysis
df_shows = pd.DataFrame(data_frame, columns=['title','rating', 'listed_in'])
# filter comedy shows
df_comedy_shows = df_shows[df_shows['listed_in'].str.contains('Comed')]
df_comedy_shows.head()
| title | rating | listed_in | |
|---|---|---|---|
| 0 | Norm of the North: King Sized Adventure | TV-PG | Children & Family Movies, Comedies |
| 1 | Jandino: Whatever it Takes | TV-MA | Stand-Up Comedy |
| 4 | #realityhigh | TV-14 | Comedies |
| 7 | Fabrizio Copano: Solo pienso en mi | TV-MA | Stand-Up Comedy |
| 10 | Joaquín Reyes: Una y no más | TV-MA | Stand-Up Comedy |
这个笔记本的Github存储库可以在 这里找到
资源
今天的文章就到这里。明天我们将继续探索更多关于机器学习和数据科学的其他步骤,并通过构建图表和图表以及创建机器学习模型来对数据进行可视化分析。