Knjižnica Pandas

Spodaj je pregled najosnovnejših metod, ki jih ponuja knjižnica Pandas. Vsaka od naštetih metod ponuja še cel kup dodatnih možnosti, ki so natančno opisane v uradni dokumentaciji. Z branjem dokumentacije se vam seveda najbolj splača začeti pri uvodih.

Predpriprava

# naložimo paket
import pandas as pd

# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('../02-zajem-podatkov/predavanja/obdelani-podatki/filmi.csv', index_col='id')

# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 20 vrstic
pd.options.display.max_rows = 20

Osnovni izbori elementov razpredelnic

Z metodo .head(n=5) pogledamo prvih n, z metodo .tail(n=5) pa zadnjih n vrstic razpredelnice.

filmi.head(10)
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
4972 The Birth of a Nation 195 1915 6.3 NaN 22363 10000000.0 NaN The Stoneman family finds its friendship with ...
6864 Intolerance: Love's Struggle Throughout the Ages 163 1916 7.7 99.0 13970 2180000.0 NaN The story of a poor young woman, separated by ...
9968 Broken Blossoms or The Yellow Man and the Girl 90 1919 7.3 NaN 9296 NaN NaN A frail waif, abused by her brutal boxer fathe...
10323 Das Cabinet des Dr. Caligari 76 1920 8.1 NaN 56089 NaN NaN Hypnotist Dr. Caligari uses a somnambulist, Ce...
12349 The Kid 68 1921 8.3 NaN 110278 5450000.0 NaN The Tramp cares for an abandoned child, but ev...
12364 Körkarlen 107 1921 8.1 NaN 10158 NaN NaN On New Year's Eve, the driver of a ghostly car...
13442 Nosferatu, simfonija groze 94 1922 7.9 NaN 87208 NaN NaN Vampire Count Orlok expresses interest in a ne...
14341 Naše gostoljubje 65 1923 7.8 NaN 9980 NaN NaN A man returns to his Appalachian homestead. On...
14429 Resitev v zadnjem trenutku 74 1923 8.1 NaN 17921 623809.0 NaN A boy leaves his small country town and heads ...
15064 Der letzte Mann 77 1924 8.1 NaN 12345 94812.0 NaN An aging doorman is forced to face the scorn o...
filmi.tail()
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
11390036 A Fall from Grace 115 2020 5.8 34.0 10414 NaN NaN Disheartened since her ex-husband's affair, Gr...
11905962 Sputnik 113 2020 6.3 61.0 8285 NaN NaN The lone survivor of an enigmatic spaceship in...
12393526 Bulbbul 94 2020 6.6 NaN 8381 NaN NaN A man returns home after years to find his bro...
12567088 Raat Akeli Hai 149 2020 7.3 NaN 12232 NaN NaN The film follows a small town cop who is summo...
12749596 Host 57 2020 6.6 72.0 8713 NaN NaN Six friends hire a medium to hold a seance via...

Z rezinami pa dostopamo do izbranih vrstic.

filmi[3:10:2]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
10323 Das Cabinet des Dr. Caligari 76 1920 8.1 NaN 56089 NaN NaN Hypnotist Dr. Caligari uses a somnambulist, Ce...
12364 Körkarlen 107 1921 8.1 NaN 10158 NaN NaN On New Year's Eve, the driver of a ghostly car...
14341 Naše gostoljubje 65 1923 7.8 NaN 9980 NaN NaN A man returns to his Appalachian homestead. On...
15064 Der letzte Mann 77 1924 8.1 NaN 12345 94812.0 NaN An aging doorman is forced to face the scorn o...

Z indeksiranjem razpredelnice dostopamo do posameznih stolpcev.

filmi['ocena']
id
4972        6.3
6864        7.7
9968        7.3
10323       8.1
12349       8.3
           ... 
11390036    5.8
11905962    6.3
12393526    6.6
12567088    7.3
12749596    6.6
Name: ocena, Length: 10000, dtype: float64

Do stolpcev pogosto dostopamo, zato lahko uporabimo tudi krajši zapis.

filmi.ocena
id
4972        6.3
6864        7.7
9968        7.3
10323       8.1
12349       8.3
           ... 
11390036    5.8
11905962    6.3
12393526    6.6
12567088    7.3
12749596    6.6
Name: ocena, Length: 10000, dtype: float64

Če želimo več stolpcev, moramo za indeks podati seznam vseh oznak.

filmi[['naslov', 'ocena']]
naslov ocena
id
4972 The Birth of a Nation 6.3
6864 Intolerance: Love's Struggle Throughout the Ages 7.7
9968 Broken Blossoms or The Yellow Man and the Girl 7.3
10323 Das Cabinet des Dr. Caligari 8.1
12349 The Kid 8.3
... ... ...
11390036 A Fall from Grace 5.8
11905962 Sputnik 6.3
12393526 Bulbbul 6.6
12567088 Raat Akeli Hai 7.3
12749596 Host 6.6

10000 rows × 2 columns

Do vrednosti z indeksom i dostopamo z .iloc[i], do tiste s ključem k pa z .loc[k].

filmi.iloc[120]
naslov                                   Foreign Correspondent
dolzina                                                    120
leto                                                      1940
ocena                                                      7.5
metascore                                                   88
glasovi                                                  18050
zasluzek                                           3.48459e+06
oznaka                                                     NaN
opis         On the eve of World War II, a young American r...
Name: 32484, dtype: object
filmi.loc[97576]
naslov                 Indiana Jones in Zadnji križarski pohod
dolzina                                                    127
leto                                                      1989
ocena                                                      8.2
metascore                                                   65
glasovi                                                 680400
zasluzek                                           1.97172e+08
oznaka                                                   PG-13
opis         In 1938, after his father Professor Henry Jone...
Name: 97576, dtype: object

Filtriranje

Izbor določenih vrstic razpredelnice naredimo tako, da za indeks podamo stolpec logičnih vrednosti, ki ga dobimo z običajnimi operacijami. V vrnjeni razpredelnici bodo ostale vrstice, pri katerih je v stolpcu vrednost True.

filmi.ocena >= 8
id
4972        False
6864        False
9968        False
10323        True
12349        True
            ...  
11390036    False
11905962    False
12393526    False
12567088    False
12749596    False
Name: ocena, Length: 10000, dtype: bool
filmi[filmi.ocena >= 9.3]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
111161 Kaznilnica odrešitve 142 1994 9.3 80.0 2293163 28341469.0 R Two imprisoned men bond over a number of years...
252487 Hababam Sinifi 87 1975 9.3 NaN 36468 NaN NaN Lazy, uneducated students share a very close b...
filmi[(filmi.leto > 2010) & (filmi.ocena > 8) | (filmi.ocena < 5)]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
28346 Tell Your Children 66 1936 3.8 70.0 7775 NaN NaN Cautionary tale features a fictionalized take ...
45826 Glen or Glenda 65 1953 4.2 NaN 7901 NaN NaN A psychiatrist tells the stories of a transves...
52077 Plan 9 from Outer Space 79 1957 4.0 NaN 35646 NaN NaN Evil aliens attack Earth and set their terribl...
54673 The Beast of Yucca Flats 54 1961 2.2 NaN 7952 NaN NaN A defecting Soviet scientist is hit by a nucle...
58548 Santa Claus Conquers the Martians 81 1964 2.7 NaN 10706 NaN NaN The Martians kidnap Santa Claus because there ...
... ... ... ... ... ... ... ... ... ...
10230426 Mrs. Serial Killer 106 2020 4.9 NaN 11734 NaN NaN When a doctor gets jailed for a string of shoc...
10324144 Article 15 130 2019 8.2 NaN 21888 NaN NaN In the rural heartlands of India, an upright p...
10431500 Yedinci Kogustaki Mucize 132 2019 8.3 NaN 30002 NaN NaN A story of love between a mentally-ill father ...
10481868 Black Christmas 92 2019 3.2 49.0 8890 NaN PG-13 A group of female students are stalked by a st...
10886166 365 dni 114 2020 3.3 NaN 39803 NaN NaN Massimo is a member of the Sicilian Mafia fami...

662 rows × 9 columns

Naloga

Poiščite filme, ki si jih želimo izogniti za vsako ceno, torej tiste, ki so daljši od dveh ur in imajo oceno pod 4.

filmi[(filmi.dolzina > 120) & (filmi.ocena < 4) & (filmi.glasovi > 50000)]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
118688 Batman in Robin 125 1997 3.8 28.0 231499 107325195.0 PG-13 Batman and Robin try to keep their relationshi...
120179 Hitrost 2 - Brez zavor 121 1997 3.9 23.0 73855 48608066.0 PG-13 A computer hacker breaks into the computer sys...
2574698 Gunday 152 2014 2.4 NaN 57521 NaN NaN The lives of Calcutta's most powerful Gunday, ...
7886848 Sadak 2 133 2020 1.1 NaN 57957 NaN NaN The film picks up where Sadak left off, revolv...

Urejanje

Razpredelnico urejamo z metodo .sort_values, ki ji podamo ime ali seznam imen stolpcev, po katerih želimo urejati. Po želji lahko tudi povemo, kateri stolpci naj bodo urejeni naraščajoče in kateri padajoče.

filmi.sort_values('dolzina')
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
15324 Sherlock Jr. 45 1924 8.2 NaN 40303 977375.0 NaN A film projectionist longs to be a detective, ...
2061702 Hotarubi no mori e 45 2011 7.9 NaN 13082 NaN NaN Hotaru is rescued by a spirit when she gets lo...
2591814 Koto no ha no niwa 46 2013 7.5 NaN 33162 NaN NaN A 15-year-old boy and 27-year-old woman find a...
275230 Blood: The Last Vampire 48 2000 6.7 44.0 11847 NaN NaN Saya is a Japanese vampire slayer whose next m...
142236 Dragon Ball Z: Fukkatsu no Fusion!! Gokuu to V... 51 1995 7.7 NaN 9262 NaN PG The universe is thrown into dimensional chaos ...
... ... ... ... ... ... ... ... ... ...
425637 Chi bi 288 2008 7.4 73.0 43267 627047.0 R The first chapter of a two-part story centered...
74084 Novecento 317 1976 7.7 70.0 22091 NaN NaN The epic tale of a class struggle in twentieth...
1954470 Gangs of Wasseypur 321 2012 8.2 89.0 79426 NaN NaN A clash between Sultan and Shahid Khan leads t...
346336 La meglio gioventù 366 2003 8.5 89.0 20280 254224.0 R An Italian epic that follows the lives of two ...
111341 Sátántangó 439 1994 8.4 NaN 9144 NaN NaN On the eve of a large payment, residents of a ...

10000 rows × 9 columns

# najprej uredi padajoče po oceni, pri vsaki oceni pa še naraščajoče po letu
filmi.sort_values(['ocena', 'leto'], ascending=[False, True])
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
252487 Hababam Sinifi 87 1975 9.3 NaN 36468 NaN NaN Lazy, uneducated students share a very close b...
111161 Kaznilnica odrešitve 142 1994 9.3 80.0 2293163 28341469.0 R Two imprisoned men bond over a number of years...
68646 Boter 175 1972 9.2 100.0 1582906 134966411.0 NaN The aging patriarch of an organized crime dyna...
5354160 Aynabaji 147 2016 9.1 NaN 21429 NaN NaN Ayna is an actor and the prison is his stage. ...
71562 Boter, II. del 202 1974 9.0 90.0 1105879 57300000.0 NaN The early life and career of Vito Corleone in ...
... ... ... ... ... ... ... ... ... ...
4009460 Saving Christmas 79 2014 1.4 18.0 14855 2783970.0 PG His annual Christmas party faltering thanks to...
6038600 Smolensk 120 2016 1.4 NaN 7630 NaN NaN Inspired by true events of 2010 Polish Air For...
5988370 Reis 108 2017 1.4 NaN 72207 NaN NaN A drama about the early life of Recep Tayyip E...
7886848 Sadak 2 133 2020 1.1 NaN 57957 NaN NaN The film picks up where Sadak left off, revolv...
7221896 Cumali Ceber: Allah Seni Alsin 100 2017 1.0 NaN 37659 NaN NaN Cumali Ceber goes to a vacation with his child...

10000 rows × 9 columns

Združevanje

Z metodo .groupby ustvarimo razpredelnico posebne vrste, v katerem so vrstice združene glede na skupno lastnost.

filmi_po_letih = filmi.groupby('leto')
filmi_po_letih
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fddf96d8520>
# povprečna ocena vsakega leta
filmi_po_letih.ocena.mean()
leto
1915    6.300000
1916    7.700000
1919    7.300000
1920    8.100000
1921    8.200000
          ...   
2016    6.513889
2017    6.401127
2018    6.458333
2019    6.525000
2020    5.950526
Name: ocena, Length: 104, dtype: float64

Če želimo, lahko združujemo tudi po izračunanih lastnostih. Izračunajmo stolpec in ga shranimo v razpredelnico.

filmi['desetletje'] = 10 * (filmi.leto // 10)
filmi
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis desetletje
id
4972 The Birth of a Nation 195 1915 6.3 NaN 22363 10000000.0 NaN The Stoneman family finds its friendship with ... 1910
6864 Intolerance: Love's Struggle Throughout the Ages 163 1916 7.7 99.0 13970 2180000.0 NaN The story of a poor young woman, separated by ... 1910
9968 Broken Blossoms or The Yellow Man and the Girl 90 1919 7.3 NaN 9296 NaN NaN A frail waif, abused by her brutal boxer fathe... 1910
10323 Das Cabinet des Dr. Caligari 76 1920 8.1 NaN 56089 NaN NaN Hypnotist Dr. Caligari uses a somnambulist, Ce... 1920
12349 The Kid 68 1921 8.3 NaN 110278 5450000.0 NaN The Tramp cares for an abandoned child, but ev... 1920
... ... ... ... ... ... ... ... ... ... ...
11390036 A Fall from Grace 115 2020 5.8 34.0 10414 NaN NaN Disheartened since her ex-husband's affair, Gr... 2020
11905962 Sputnik 113 2020 6.3 61.0 8285 NaN NaN The lone survivor of an enigmatic spaceship in... 2020
12393526 Bulbbul 94 2020 6.6 NaN 8381 NaN NaN A man returns home after years to find his bro... 2020
12567088 Raat Akeli Hai 149 2020 7.3 NaN 12232 NaN NaN The film follows a small town cop who is summo... 2020
12749596 Host 57 2020 6.6 72.0 8713 NaN NaN Six friends hire a medium to hold a seance via... 2020

10000 rows × 10 columns

filmi_po_desetletjih = filmi.groupby('desetletje')

Preštejemo, koliko filmov je bilo v vsakem desetletju. Pri večini stolpcev dobimo iste številke, ker imamo v vsakem stolpcu enako vnosov. Če kje kakšen podatek manjkal, je številka manjša.

filmi_po_desetletjih.count()
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
desetletje
1910 3 3 3 3 1 3 2 0 3
1920 27 27 27 27 3 27 17 1 27
1930 88 88 88 88 36 88 31 5 88
1940 149 149 149 149 51 149 50 9 149
1950 224 224 224 224 91 224 90 20 224
1960 319 319 319 319 161 319 154 52 319
1970 444 444 444 444 316 444 271 48 444
1980 875 875 875 875 722 875 748 62 875
1990 1536 1536 1536 1536 1041 1536 1420 1231 1536
2000 2777 2777 2777 2777 2296 2777 2322 2325 2777
2010 3463 3463 3463 3463 2797 3463 2412 2533 3463
2020 95 95 95 95 76 95 8 58 95

Če želimo dobiti le število članov posamezne skupine, uporabimo metodo .size(). V tem primeru dobimo le stolpec, ne razpredelnice.

filmi_po_desetletjih.size()
desetletje
1910       3
1920      27
1930      88
1940     149
1950     224
1960     319
1970     444
1980     875
1990    1536
2000    2777
2010    3463
2020      95
dtype: int64

Pogledamo povprečja vsakega desetletja. Dobimo povprečno leto, dolžino, ocene in zaslužek. Povprečnega naslova ne dobimo, ker se ga ne da izračunati, zato ustreznega stolpca ni.

filmi_po_desetletjih.mean()
dolzina leto ocena metascore glasovi zasluzek
desetletje
1910 149.333333 1916.666667 7.100000 99.000000 15209.666667 6.090000e+06
1920 88.555556 1925.296296 7.892593 84.666667 36366.851852 2.077247e+06
1930 97.056818 1935.272727 7.610227 87.083333 35382.784091 1.466444e+07
1940 105.677852 1944.704698 7.644966 86.039216 36569.973154 1.165745e+07
1950 108.665179 1954.897321 7.540179 82.318681 43097.950893 1.549802e+07
1960 118.282132 1964.620690 7.440752 75.546584 42448.840125 2.082776e+07
1970 112.191441 1974.763514 7.234459 67.515823 55681.234234 3.059132e+07
1980 106.811429 1985.125714 6.732114 57.404432 57676.565714 3.320314e+07
1990 109.961589 1995.000651 6.581510 56.834774 78348.498698 3.588821e+07
2000 108.387829 2004.993878 6.456068 53.987369 88156.186892 3.791655e+07
2010 109.803350 2014.449321 6.454981 56.086164 79023.464049 4.348670e+07
2020 108.778947 2020.000000 5.950526 53.447368 31318.936842 9.320018e+07

Naloga

Izračunajte število filmov posamezne dolžine, zaokrožene na 5 minut.

Risanje grafov

Običajen graf dobimo z metodo plot. Uporabljamo ga, kadar želimo prikazati spreminjanje vrednosti v odvisnosti od zvezne spremenljivke. Naša hipoteza je, da so zlata leta filma mimo. Graf to zanika.

filmi[filmi.ocena > 9].groupby('desetletje').size().plot()
<AxesSubplot:xlabel='desetletje'>
_images/03-knjiznica-pandas_40_1.png

Razsevni diagram dobimo z metodo plot.scatter. Uporabljamo ga, če želimo ugotoviti povezavo med dvema spremenljivkama.

filmi.plot.scatter('ocena', 'metascore')
<AxesSubplot:xlabel='ocena', ylabel='metascore'>
_images/03-knjiznica-pandas_42_1.png
filmi[filmi.dolzina < 250].plot.scatter('dolzina', 'ocena')
<AxesSubplot:xlabel='dolzina', ylabel='ocena'>
_images/03-knjiznica-pandas_43_1.png

Stolpčni diagram dobimo z metodo plot.bar. Uporabljamo ga, če želimo primerjati vrednosti pri diskretnih (običajno kategoričnih) spremenljivkah. Pogosto je koristno, da graf uredimo po vrednostih.

filmi.sort_values('zasluzek', ascending=False).head(20).plot.bar(x='naslov', y='zasluzek')
<AxesSubplot:xlabel='naslov'>
_images/03-knjiznica-pandas_45_1.png

Naloga

Narišite grafe, ki ustrezno kažejo:

  • Povezavo med IMDB in metascore oceno

  • Spreminjanje povprečne dolžine filmov skozi leta

Stikanje

osebe = pd.read_csv('../02-zajem-podatkov/predavanja/obdelani-podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('../02-zajem-podatkov/predavanja/obdelani-podatki/vloge.csv')
zanri = pd.read_csv('../02-zajem-podatkov/predavanja/obdelani-podatki/zanri.csv')

Razpredelnice stikamo s funkcijo merge, ki vrne razpredelnico vnosov iz obeh tabel, pri katerih se vsi istoimenski podatki ujemajo.

vloge[vloge.film == 12349]
film oseba vloga mesto
20 12349 122 I 1
21 12349 701012 I 2
22 12349 1067 I 3
23 12349 588033 I 4
24 12349 122 R 1
zanri[zanri.film == 12349]
film zanr
10 12349 Comedy
11 12349 Drama
12 12349 Family
pd.merge(vloge, zanri).head(20)
film oseba vloga mesto zanr
0 4972 1273 I 1 Drama
1 4972 1273 I 1 History
2 4972 1273 I 1 War
3 4972 550615 I 2 Drama
4 4972 550615 I 2 History
5 4972 550615 I 2 War
6 4972 910400 I 3 Drama
7 4972 910400 I 3 History
8 4972 910400 I 3 War
9 4972 178270 I 4 Drama
10 4972 178270 I 4 History
11 4972 178270 I 4 War
12 4972 428 R 1 Drama
13 4972 428 R 1 History
14 4972 428 R 1 War
15 6864 1273 I 1 Drama
16 6864 1273 I 1 History
17 6864 366008 I 2 Drama
18 6864 366008 I 2 History
19 6864 550615 I 3 Drama

V osnovi vsebuje staknjena razpredelnica le tiste vnose, ki se pojavijo v obeh tabelah. Temu principu pravimo notranji stik (inner join). Lahko pa se odločimo, da izberemo tudi tiste vnose, ki imajo podatke le v levi tabeli (left join), le v desni tabeli (right join) ali v vsaj eni tabeli (outer join). Če v eni tabeli ni vnosov, bodo v staknjeni tabeli označene manjkajoče vrednosti. Ker smo v našem primeru podatke jemali iz IMDBja, kjer so za vsak film določeni tako žanri kot vloge, do razlik ne pride.

Včasih želimo stikati tudi po stolpcih z različnimi imeni. V tem primeru funkciji merge podamo argumenta left_on in right_on.

pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
film oseba vloga mesto zanr ime
0 4972 1273 I 1 Drama Lillian Gish
1 4972 1273 I 1 History Lillian Gish
2 4972 1273 I 1 War Lillian Gish
3 6864 1273 I 1 Drama Lillian Gish
4 6864 1273 I 1 History Lillian Gish
... ... ... ... ... ... ...
130132 12749596 8772807 I 3 Mystery Emma Louise Webb
130133 12749596 7525711 I 4 Horror Radina Drandova
130134 12749596 7525711 I 4 Mystery Radina Drandova
130135 12749596 3699074 R 1 Horror Rob Savage
130136 12749596 3699074 R 1 Mystery Rob Savage

130137 rows × 6 columns

Poglejmo, katera osebe so nastopale v največ komedijah.

zanri_oseb = pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
zanri_oseb[
    (zanri_oseb.zanr == 'Comedy') &
    (zanri_oseb.vloga == 'I')
].groupby(
    'ime'
).size(
).sort_values(
    ascending=False
).head(20)
ime
Adam Sandler        39
Eddie Murphy        36
Jackie Chan         34
Owen Wilson         32
Will Ferrell        32
Ben Stiller         32
Steve Martin        31
Robin Williams      31
Woody Allen         29
Bill Murray         28
Tom Hanks           28
Steve Carell        26
Cameron Diaz        26
Jennifer Aniston    26
Julia Roberts       25
Akshay Kumar        25
Jack Black          24
Paul Rudd           24
Danny DeVito        24
Bruce Willis        24
dtype: int64

Naloga

  • Izračunajte povprečno oceno vsakega žanra.

  • Kateri režiserji snemajo najdonosnejše filme?