(翻译)30天学习Python👨‍💻第二十七天——机器学习 & 数据科学(一)

427 阅读8分钟

(翻译)30天学习Python👨‍💻第二十七天——机器学习 & 数据科学(一)

现在是时候深入一些真正的机器学习和数据科学的编码了。今天我主要集中在开始使用 Jupyter Notebook 工作流程,并且创建一个基础项目以理解它是如何工作的。最终搜索一些数据集,然后遵循机器学习的基本原则,从中生成有用的信息。我也会分享我创建的 notebook。Jupyter Notebooks 最有用的是像一篇博客或文章一样和交互式代码、数据和其他信息一起进行组织。

使用Jupyter Notebooks工作

我想提供一些很棒的参考来理解 Jupyter Notebook 的界面、安装指南以及工作流程。

因为我是windows用户,我想提供一个简单的提示:

在windows中,从开始菜单打开 Anaconda 提示符,导航到你想创建 jupyter项目的目录,然后运行 jupyter notebook 命令,它就会在你的浏览器中打开notebook了。

根据机器学习和数据科学基础步骤,我们将创建一个项目并且创建一个可读的 notebook 记录整个过程,它可以被分享给任何人。

使用Netflix展示基础数据科学和机器学习项目

注:Netflix是一家在线影片租赁提供商

机器学习和数据科学的基本步骤:

  • 从资源中导入数据
  • 清理数据,如果需要的话,删除任何不相关的数据
  • 将数据分为训练数据和测试数据
  • 创建一个模型或者算法或者函数
  • 检查输出
  • 改进并重复上面的步骤

在这个基础项目中我们会探索前两个步骤

1.导入并操作数据

对于机器学习和数据科学来说首先的和最重要的事情是数据本身。想要得到有意义的结论,我们必须有好的数据集。这种输入数据可以通过多种方式收集——从数据库收集,从网络爬取,公开的API或者分享的数据集。

Kaggle 是一个深受机器学习和数据科学爱好者欢迎的网站,在这里可以找到大量的公开共享数据集。

我决定搜索Netflix的节目数据集,并从 Kaggle - www.kaggle.com/shivamb/net… 找到了这个。它包含将用于该项目的CSV格式的数据。下载完文件后,可以将其放在项目的根目录中。我将它命名为 netflix_titles.csv

由于该数据是一种表格格式,也就是说它是按行和列排列的, pandas 是一个很好的处理和分析这类数据的开源库。它与Anaconda工具包一起提供,因此可以直接在笔记本上使用。

import pandas as pd
data_frame = pd.read_csv('netflix_titles.csv')
data_frame.head(10) # show first 10 results
# prints the data frame in as a table
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
081145628MovieNorm of the North: King Sized AdventureRichard Finn, Tim MaltbyAlan Marriott, Andrew Toth, Brian Dobson, Cole...United States, India, South Korea, ChinaSeptember 9, 20192019TV-PG90 minChildren & Family Movies, ComediesBefore planning an awesome wedding for his gra...
180117401MovieJandino: Whatever it TakesNaNJandino AsporaatUnited KingdomSeptember 9, 20162016TV-MA94 minStand-Up ComedyJandino Asporaat riffs on the challenges of ra...
270234439TV ShowTransformers PrimeNaNPeter Cullen, Sumalee Montano, Frank Welker, J...United StatesSeptember 8, 20182013TV-Y7-FV1 SeasonKids' TVWith the help of three human allies, the Autob...
380058654TV ShowTransformers: Robots in DisguiseNaNWill Friedle, Darren Criss, Constance Zimmer, ...United StatesSeptember 8, 20182016TV-Y71 SeasonKids' TVWhen a prison ship crash unleashes hundreds of...
480125979Movie#realityhighFernando LebrijaNesta Cooper, Kate Walsh, John Michael Higgins...United StatesSeptember 8, 20172017TV-1499 minComediesWhen nerdy high schooler Dani finally attracts...
580163890TV ShowApachesNaNAlberto Ammann, Eloy Azorín, Verónica Echegui,...SpainSeptember 8, 20172016TV-MA1 SeasonCrime TV Shows, International TV Shows, Spanis...A young journalist is forced into a life of cr...
670304989MovieAutomataGabe IbáñezAntonio Banderas, Dylan McDermott, Melanie Gri...Bulgaria, United States, Spain, CanadaSeptember 8, 20172014R110 minInternational Movies, Sci-Fi & Fantasy, ThrillersIn a dystopian future, an insurance adjuster f...
780164077MovieFabrizio Copano: Solo pienso en miRodrigo Toro, Francisco SchultzFabrizio CopanoChileSeptember 8, 20172017TV-MA60 minStand-Up ComedyFabrizio Copano takes audience participation t...
880117902TV ShowFire ChasersNaNNaNUnited StatesSeptember 8, 20172017TV-MA1 SeasonDocuseries, Science & Nature TVAs California's 2016 fire season rages, brave ...
970304990MovieGood PeopleHenrik Ruben GenzJames Franco, Kate Hudson, Tom Wilkinson, Omar...United States, United Kingdom, Denmark, SwedenSeptember 8, 20172014R90 minAction & Adventure, ThrillersA struggling couple can't believe their luck w...
data_frame.info()
# shows information about column data types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int64 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB
data_frame.shape
# provides information of rows and columns as a tuple
(6234, 12)
data_frame.describe()
# shows some basic description
show_idrelease_year
count6.234000e+036234.00000
mean7.670368e+072013.35932
std1.094296e+078.81162
min2.477470e+051925.00000
25%8.003580e+072013.00000
50%8.016337e+072016.00000
75%8.024489e+072018.00000
max8.123573e+072020.00000
data_frame['title'].head() # lists a specific column data with first 5 entries (head)
0    Norm of the North: King Sized Adventure
1                 Jandino: Whatever it Takes
2                         Transformers Prime
3           Transformers: Robots in Disguise
4                               #realityhigh
Name: title, dtype: object
# Filtering Data
data_frame[data_frame['country'] == 'India'].head()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
3581154455MovieArticle 15Anubhav SinhaAyushmann Khurrana, Nassar, Manoj Pahwa, Kumud...IndiaSeptember 6, 20192019TV-MA125 minDramas, International Movies, ThrillersThe grim realities of caste discrimination com...
3781052275MovieEe Nagaraniki EmaindiTharun BhasckerVishwaksen Naidu, Sushanth Reddy, Abhinav Goma...IndiaSeptember 6, 20192018TV-14133 minComedies, International MoviesIn Goa and in desperate need of cash, four chi...
4170303496MoviePKRajkumar HiraniAamir Khan, Anuskha Sharma, Sanjay Dutt, Saura...IndiaSeptember 6, 20182014TV-14146 minComedies, Dramas, International MoviesAamir Khan teams with director Rajkumar Hirani...
5881155784MovieWatchmanA. L. VijayG.V. Prakash Kumar, Samyuktha Hegde, Suman, Ra...IndiaSeptember 4, 20192019TV-1493 minComedies, Dramas, International MoviesRushing to pay off a loan shark, a young man b...
9980225885TV ShowBard of BloodNaNEmraan Hashmi, Viineet Kumar, Sobhita Dhulipal...IndiaSeptember 27, 20192019TV-MA1 SeasonInternational TV Shows, TV Action & Adventure,...Years after a disastrous job in Balochistan, a...
# Sorting Data
data_frame.sort_values('release_year', ascending=False).head()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
346781011449TV ShowMedical PoliceNaNErinn Hayes, Rob Huebel, Malin Akerman, Rob Co...United StatesJanuary 10, 20202020TV-MA1 SeasonCrime TV Shows, TV Action & Adventure, TV Come...Doctors Owen Maestro and Lola Spratt leave Chi...
324981006825MovieAll the Freckles in the WorldYibrán AsuadHánssel Casillas, Loreto Peralta, Andrea Sutto...MexicoJanuary 3, 20202020TV-1490 minComedies, International Movies, Romantic MoviesThirteen-year-old José Miguel is immune to 199...
322080997687TV ShowDraculaNaNClaes Bang, Dolly Wells, John HeffernanUnited KingdomJanuary 4, 20202020TV-141 SeasonBritish TV Shows, International TV Shows, TV D...The Count Dracula legend transforms with new t...
342781060049MovieLeslie Jones: Time MachineDavid Benioff, D.B. WeissLeslie JonesUnited StatesJanuary 14, 20202020TV-MA66 minStand-Up ComedyFrom trying to seduce Prince to battling sleep...
343680239306TV ShowThe Healing Powers of DudeNaNJace Chapman, Larisa Oleynik, Tom Everett Scot...NaNJanuary 13, 20202020TV-G1 SeasonKids' TV, TV Comedies, TV DramasWhen an 11-year-old boy with social anxiety di...

这是一个很好的Python数据科学备忘单,它列出了所有常用的Pandas方法和属性以及其他用于数据科学的库。

2.数据清理

下一步是清理数据并删除分析中不需要的任何类型的信息。

让我们考虑一个示例用例,在这个用例中,我们希望找到适合所有年龄层(TV-G级别)的 Netflix 喜剧电影和节目。

# Let's select the relevant columns for analysis
df_shows = pd.DataFrame(data_frame, columns=['title','rating', 'listed_in'])
# filter comedy shows
df_comedy_shows = df_shows[df_shows['listed_in'].str.contains('Comed')]
df_comedy_shows.head()
titleratinglisted_in
0Norm of the North: King Sized AdventureTV-PGChildren & Family Movies, Comedies
1Jandino: Whatever it TakesTV-MAStand-Up Comedy
4#realityhighTV-14Comedies
7Fabrizio Copano: Solo pienso en miTV-MAStand-Up Comedy
10Joaquín Reyes: Una y no másTV-MAStand-Up Comedy

这个笔记本的Github存储库可以在 这里找到

资源

今天的文章就到这里。明天我们将继续探索更多关于机器学习和数据科学的其他步骤,并通过构建图表和图表以及创建机器学习模型来对数据进行可视化分析。