Knjižnica Pandas#
Spodaj je pregled najosnovnejših metod, ki jih ponuja knjižnica Pandas. Vsaka od naštetih metod ponuja še cel kup dodatnih možnosti, ki so natančno opisane v uradni dokumentaciji. Z branjem dokumentacije se vam seveda najbolj splača začeti pri uvodih.
Predpriprava#
# naložimo paket
import pandas as pd
# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('podatki/filmi.csv', index_col='id')
# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 20 vrstic
pd.options.display.max_rows = 20
Osnovni izbori elementov razpredelnic#
Z metodo .head(n=5)
pogledamo prvih n
, z metodo .tail(n=5)
pa zadnjih n
vrstic razpredelnice.
filmi.head(10)
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
4972 | The Birth of a Nation | 195 | 1915 | 6.2 | NaN | 24890 | 10000000.0 | TV-PG | The Stoneman family finds its friendship with ... |
6864 | Intolerance | 197 | 1916 | 7.7 | 99.0 | 15670 | 2180000.0 | Passed | The story of a poor young woman separated by p... |
9968 | Broken Blossoms | 90 | 1919 | 7.3 | NaN | 10423 | NaN | Not Rated | A frail waif, abused by her brutal boxer fathe... |
10323 | The Cabinet of Dr. Caligari | 67 | 1920 | 8.0 | NaN | 64133 | NaN | Not Rated | Hypnotist Dr. Caligari uses a somnambulist, Ce... |
12349 | The Kid | 68 | 1921 | 8.3 | NaN | 126513 | 5450000.0 | Passed | The Tramp cares for an abandoned child, but ev... |
12364 | The Phantom Carriage | 107 | 1921 | 8.0 | NaN | 12624 | NaN | Not Rated | On New Year's Eve, the driver of a ghostly car... |
13442 | Nosferatu | 94 | 1922 | 7.9 | NaN | 97589 | NaN | Not Rated | Vampire Count Orlok expresses interest in a ne... |
14341 | Our Hospitality | 65 | 1923 | 7.8 | NaN | 11428 | 1172499.0 | Passed | A man returns to his Appalachian homestead. On... |
14429 | Safety Last! | 74 | 1923 | 8.1 | NaN | 20887 | 1359903.0 | Not Rated | A boy leaves his small country town and heads ... |
15064 | The Last Laugh | 90 | 1924 | 8.0 | NaN | 14150 | 94812.0 | Not Rated | An aging doorman is forced to face the scorn o... |
filmi.tail()
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
18568902 | Kaun Pravin Tambe? | 134 | 2022 | 8.4 | NaN | 10163 | NaN | NaN | An indian cricketer who shows persistence and ... |
18689424 | Batman v Superman: Dawn of Justice - Ultimate ... | 182 | 2016 | 7.1 | NaN | 57662 | NaN | R | Batman is manipulated by Lex Luthor to fear Su... |
18968540 | Incantation | 110 | 2022 | 6.2 | NaN | 12366 | NaN | TV-MA | Six years ago, Li Ronan was cursed after break... |
20850406 | Sita Ramam | 163 | 2022 | 8.5 | NaN | 38490 | NaN | NaN | An orphan soldier, Lieutenant Ram's life chang... |
21279138 | Maid in Malacañang | 114 | 2022 | 3.9 | NaN | 15273 | NaN | NaN | The Last Days of Ferdinand and Imelda Marcos t... |
Z rezinami pa dostopamo do izbranih vrstic.
filmi[3:10:2]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
10323 | The Cabinet of Dr. Caligari | 67 | 1920 | 8.0 | NaN | 64133 | NaN | Not Rated | Hypnotist Dr. Caligari uses a somnambulist, Ce... |
12364 | The Phantom Carriage | 107 | 1921 | 8.0 | NaN | 12624 | NaN | Not Rated | On New Year's Eve, the driver of a ghostly car... |
14341 | Our Hospitality | 65 | 1923 | 7.8 | NaN | 11428 | 1172499.0 | Passed | A man returns to his Appalachian homestead. On... |
15064 | The Last Laugh | 90 | 1924 | 8.0 | NaN | 14150 | 94812.0 | Not Rated | An aging doorman is forced to face the scorn o... |
Z indeksiranjem razpredelnice dostopamo do posameznih stolpcev.
filmi['ocena']
id
4972 6.2
6864 7.7
9968 7.3
10323 8.0
12349 8.3
...
18568902 8.4
18689424 7.1
18968540 6.2
20850406 8.5
21279138 3.9
Name: ocena, Length: 9999, dtype: float64
Do stolpcev pogosto dostopamo, zato lahko uporabimo tudi krajši zapis.
filmi.ocena
id
4972 6.2
6864 7.7
9968 7.3
10323 8.0
12349 8.3
...
18568902 8.4
18689424 7.1
18968540 6.2
20850406 8.5
21279138 3.9
Name: ocena, Length: 9999, dtype: float64
Če želimo več stolpcev, moramo za indeks podati seznam vseh oznak.
filmi[['naslov', 'ocena']]
naslov | ocena | |
---|---|---|
id | ||
4972 | The Birth of a Nation | 6.2 |
6864 | Intolerance | 7.7 |
9968 | Broken Blossoms | 7.3 |
10323 | The Cabinet of Dr. Caligari | 8.0 |
12349 | The Kid | 8.3 |
... | ... | ... |
18568902 | Kaun Pravin Tambe? | 8.4 |
18689424 | Batman v Superman: Dawn of Justice - Ultimate ... | 7.1 |
18968540 | Incantation | 6.2 |
20850406 | Sita Ramam | 8.5 |
21279138 | Maid in Malacañang | 3.9 |
9999 rows × 2 columns
Do vrednosti z indeksom i
dostopamo z .iloc[i]
, do tiste s ključem k
pa z .loc[k]
.
filmi.iloc[120]
naslov Rebecca
dolzina 130
leto 1940
ocena 8.1
metascore 86.0
glasovi 137358
zasluzek 4360000.0
oznaka Approved
opis A self-conscious woman juggles adjusting to he...
Name: 32976, dtype: object
filmi.loc[97576]
naslov Indiana Jones and the Last Crusade
dolzina 127
leto 1989
ocena 8.2
metascore 65.0
glasovi 750654
zasluzek 197171806.0
oznaka PG-13
opis In 1938, after his father Professor Henry Jone...
Name: 97576, dtype: object
Filtriranje#
Izbor določenih vrstic razpredelnice naredimo tako, da za indeks podamo stolpec logičnih vrednosti, ki ga dobimo z običajnimi operacijami. V vrnjeni razpredelnici bodo ostale vrstice, pri katerih je v stolpcu vrednost True
.
filmi.ocena >= 8
id
4972 False
6864 False
9968 False
10323 True
12349 True
...
18568902 True
18689424 False
18968540 False
20850406 True
21279138 False
Name: ocena, Length: 9999, dtype: bool
filmi[filmi.ocena >= 9.3]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
111161 | The Shawshank Redemption | 142 | 1994 | 9.3 | 81.0 | 2651625 | 28341469.0 | R | Two imprisoned men bond over a number of years... |
15327088 | Kantara | 148 | 2022 | 9.4 | NaN | 33294 | NaN | NaN | It involves culture of Kambla and Bhootha Kola... |
filmi[(filmi.leto > 2010) & (filmi.ocena > 8) | (filmi.ocena < 5)]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
52077 | Plan 9 from Outer Space | 79 | 1957 | 3.9 | 56.0 | 38744 | NaN | Not Rated | Evil aliens attack Earth and set their terribl... |
54673 | The Beast of Yucca Flats | 54 | 1961 | 1.8 | NaN | 11242 | NaN | Unrated | A defecting Soviet scientist is hit by a nucle... |
58548 | Santa Claus Conquers the Martians | 81 | 1964 | 2.6 | NaN | 11838 | NaN | Not Rated | The Martians kidnap Santa Claus because there ... |
59464 | Monster a Go-Go | 68 | 1965 | 1.7 | NaN | 11138 | NaN | TV-PG | A space capsule crash-lands on Earth, and the ... |
60666 | Manos: The Hands of Fate | 70 | 1966 | 1.6 | NaN | 36445 | NaN | Not Rated | A family gets lost on the road and stumbles up... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
15654262 | Chup | 135 | 2022 | 8.4 | NaN | 13098 | NaN | NaN | A psychopath killer, targeting film critics. T... |
16492678 | Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi... | 87 | 2021 | 9.0 | NaN | 12634 | NaN | NaN | Tanjiro ventures to the south-southeast where ... |
18568902 | Kaun Pravin Tambe? | 134 | 2022 | 8.4 | NaN | 10163 | NaN | NaN | An indian cricketer who shows persistence and ... |
20850406 | Sita Ramam | 163 | 2022 | 8.5 | NaN | 38490 | NaN | NaN | An orphan soldier, Lieutenant Ram's life chang... |
21279138 | Maid in Malacañang | 114 | 2022 | 3.9 | NaN | 15273 | NaN | NaN | The Last Days of Ferdinand and Imelda Marcos t... |
695 rows × 9 columns
Naloga#
Poiščite filme, ki si jih želimo izogniti za vsako ceno, torej tiste, ki so daljši od dveh ur in imajo oceno pod 4.
filmi[(filmi.dolzina > 120) & (filmi.ocena < 4) & (filmi.glasovi > 50000)]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
118688 | Batman & Robin | 125 | 1997 | 3.7 | 28.0 | 253972 | 107325195.0 | PG-13 | Batman and Robin try to keep their relationshi... |
120179 | Speed 2: Cruise Control | 121 | 1997 | 3.9 | 23.0 | 81714 | 48608066.0 | PG-13 | A computer hacker breaks into the computer sys... |
2574698 | Gunday | 152 | 2014 | 2.6 | NaN | 59270 | NaN | Not Rated | The lives of Calcutta's most powerful Gunday, ... |
7886848 | Sadak 2 | 133 | 2020 | 1.1 | NaN | 95865 | NaN | TV-MA | The film picks up where Sadak left off, revolv... |
10350922 | Laxmii | 141 | 2020 | 2.6 | NaN | 57411 | NaN | TV-MA | Aasif visits his wife's parents' house and hap... |
10888594 | Radhe | 135 | 2021 | 1.9 | NaN | 177814 | NaN | TV-MA | After taking the dreaded gangster Gani Bhai, A... |
Urejanje#
Razpredelnico urejamo z metodo .sort_values
, ki ji podamo ime ali seznam imen stolpcev, po katerih želimo urejati. Po želji lahko tudi povemo, kateri stolpci naj bodo urejeni naraščajoče in kateri padajoče.
filmi.sort_values('dolzina')
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2061702 | To the Forest of Firefly Lights | 45 | 2011 | 7.8 | NaN | 18535 | NaN | NaN | Hotaru is rescued by a spirit when she gets lo... |
15324 | Sherlock Jr. | 45 | 1924 | 8.2 | NaN | 50180 | 977375.0 | Passed | A film projectionist longs to be a detective, ... |
2591814 | The Garden of Words | 46 | 2013 | 7.4 | NaN | 44624 | NaN | TV-14 | A 15-year-old boy and 27-year-old woman find a... |
275230 | Blood: The Last Vampire | 48 | 2000 | 6.6 | 44.0 | 12761 | NaN | Not Rated | Saya is a Japanese vampire slayer whose next m... |
142236 | Dragon Ball Z: Revival Fusion | 51 | 1995 | 7.6 | NaN | 11050 | NaN | PG | The universe is thrown into dimensional chaos ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
107007 | Gettysburg | 271 | 1993 | 7.6 | NaN | 29479 | 10769960.0 | PG | In 1863, the Northern and Southern forces figh... |
74084 | 1900 | 317 | 1976 | 7.7 | 70.0 | 25679 | NaN | Unrated | The epic tale of a class struggle in twentieth... |
1954470 | Gangs of Wasseypur | 321 | 2012 | 8.2 | 89.0 | 96141 | NaN | Not Rated | A clash between Sultan and Shahid Khan leads t... |
346336 | The Best of Youth | 366 | 2003 | 8.5 | 89.0 | 22119 | 274024.0 | R | An Italian epic that follows the lives of two ... |
111341 | Satantango | 439 | 1994 | 8.3 | NaN | 11214 | NaN | Not Rated | On the eve of a large payment, residents of a ... |
9999 rows × 9 columns
# najprej uredi padajoče po oceni, pri vsaki oceni pa še naraščajoče po letu
filmi.sort_values(['ocena', 'leto'], ascending=[False, True])
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
15327088 | Kantara | 148 | 2022 | 9.4 | NaN | 33294 | NaN | NaN | It involves culture of Kambla and Bhootha Kola... |
111161 | The Shawshank Redemption | 142 | 1994 | 9.3 | 81.0 | 2651625 | 28341469.0 | R | Two imprisoned men bond over a number of years... |
68646 | The Godfather | 175 | 1972 | 9.2 | 100.0 | 1838099 | 134966411.0 | R | The aging patriarch of an organized crime dyna... |
252487 | The Chaos Class | 87 | 1975 | 9.2 | NaN | 40747 | NaN | NaN | Lazy, uneducated students share a very close b... |
50083 | 12 Angry Men | 96 | 1957 | 9.0 | 96.0 | 782923 | 4360000.0 | Approved | The jury in a New York City murder trial is fr... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
421051 | Daniel the Wizard | 81 | 2004 | 1.2 | NaN | 14413 | NaN | Not Rated | Evil assassins want to kill Daniel Kublbock, t... |
6038600 | Smolensk | 120 | 2016 | 1.2 | NaN | 39704 | NaN | NaN | An inspired story of people affected by the tr... |
7886848 | Sadak 2 | 133 | 2020 | 1.1 | NaN | 95865 | NaN | TV-MA | The film picks up where Sadak left off, revolv... |
5988370 | Reis | 108 | 2017 | 1.0 | NaN | 73382 | NaN | NaN | A drama about the early life of Recep Tayyip E... |
7221896 | Cumali Ceber | 100 | 2017 | 1.0 | NaN | 38958 | NaN | NaN | Cumali Ceber goes to a vacation with his child... |
9999 rows × 9 columns
Združevanje#
Z metodo .groupby
ustvarimo razpredelnico posebne vrste, v katerem so vrstice združene glede na skupno lastnost.
filmi_po_letih = filmi.groupby('leto')
filmi_po_letih
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb7d8423ee0>
# povprečna ocena vsakega leta
filmi_po_letih.ocena.mean()
leto
1915 6.200000
1916 7.700000
1919 7.300000
1920 8.000000
1921 8.150000
...
2018 6.430748
2019 6.493051
2020 6.144304
2021 6.369742
2022 6.361628
Name: ocena, Length: 106, dtype: float64
Če želimo, lahko združujemo tudi po izračunanih lastnostih. Izračunajmo stolpec in ga shranimo v razpredelnico.
filmi['desetletje'] = 10 * (filmi.leto // 10)
filmi
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | desetletje | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
4972 | The Birth of a Nation | 195 | 1915 | 6.2 | NaN | 24890 | 10000000.0 | TV-PG | The Stoneman family finds its friendship with ... | 1910 |
6864 | Intolerance | 197 | 1916 | 7.7 | 99.0 | 15670 | 2180000.0 | Passed | The story of a poor young woman separated by p... | 1910 |
9968 | Broken Blossoms | 90 | 1919 | 7.3 | NaN | 10423 | NaN | Not Rated | A frail waif, abused by her brutal boxer fathe... | 1910 |
10323 | The Cabinet of Dr. Caligari | 67 | 1920 | 8.0 | NaN | 64133 | NaN | Not Rated | Hypnotist Dr. Caligari uses a somnambulist, Ce... | 1920 |
12349 | The Kid | 68 | 1921 | 8.3 | NaN | 126513 | 5450000.0 | Passed | The Tramp cares for an abandoned child, but ev... | 1920 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18568902 | Kaun Pravin Tambe? | 134 | 2022 | 8.4 | NaN | 10163 | NaN | NaN | An indian cricketer who shows persistence and ... | 2020 |
18689424 | Batman v Superman: Dawn of Justice - Ultimate ... | 182 | 2016 | 7.1 | NaN | 57662 | NaN | R | Batman is manipulated by Lex Luthor to fear Su... | 2010 |
18968540 | Incantation | 110 | 2022 | 6.2 | NaN | 12366 | NaN | TV-MA | Six years ago, Li Ronan was cursed after break... | 2020 |
20850406 | Sita Ramam | 163 | 2022 | 8.5 | NaN | 38490 | NaN | NaN | An orphan soldier, Lieutenant Ram's life chang... | 2020 |
21279138 | Maid in Malacañang | 114 | 2022 | 3.9 | NaN | 15273 | NaN | NaN | The Last Days of Ferdinand and Imelda Marcos t... | 2020 |
9999 rows × 10 columns
filmi_po_desetletjih = filmi.groupby('desetletje')
Preštejemo, koliko filmov je bilo v vsakem desetletju. Pri večini stolpcev dobimo iste številke, ker imamo v vsakem stolpcu enako vnosov. Če kje kakšen podatek manjkal, je številka manjša.
filmi_po_desetletjih.count()
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
desetletje | |||||||||
1910 | 3 | 3 | 3 | 3 | 1 | 3 | 2 | 3 | 3 |
1920 | 27 | 27 | 27 | 27 | 4 | 27 | 18 | 27 | 27 |
1930 | 80 | 80 | 80 | 80 | 39 | 80 | 36 | 80 | 80 |
1940 | 134 | 134 | 134 | 134 | 63 | 134 | 46 | 133 | 134 |
1950 | 205 | 205 | 205 | 205 | 113 | 205 | 92 | 205 | 205 |
1960 | 284 | 284 | 284 | 284 | 172 | 284 | 150 | 281 | 284 |
1970 | 410 | 410 | 410 | 410 | 323 | 410 | 276 | 394 | 410 |
1980 | 823 | 823 | 823 | 823 | 721 | 823 | 711 | 809 | 823 |
1990 | 1420 | 1420 | 1420 | 1420 | 1128 | 1420 | 1324 | 1399 | 1420 |
2000 | 2575 | 2575 | 2575 | 2575 | 2183 | 2575 | 2203 | 2507 | 2575 |
2010 | 3358 | 3358 | 3358 | 3358 | 2728 | 3358 | 2353 | 3228 | 3358 |
2020 | 680 | 680 | 680 | 680 | 500 | 680 | 47 | 586 | 680 |
Če želimo dobiti le število članov posamezne skupine, uporabimo metodo .size()
. V tem primeru dobimo le stolpec, ne razpredelnice.
filmi_po_desetletjih.size()
desetletje
1910 3
1920 27
1930 80
1940 134
1950 205
1960 284
1970 410
1980 823
1990 1420
2000 2575
2010 3358
2020 680
dtype: int64
Pogledamo povprečja vsakega desetletja. Dobimo povprečno leto, dolžino, ocene in zaslužek. Povprečnega naslova ne dobimo, ker se ga ne da izračunati, zato ustreznega stolpca ni.
filmi_po_desetletjih.mean()
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1490, in GroupBy._cython_agg_general.<locals>.array_func(values)
1489 try:
-> 1490 result = self.grouper._cython_operation(
1491 "aggregate",
1492 values,
1493 how,
1494 axis=data.ndim - 1,
1495 min_count=min_count,
1496 **kwargs,
1497 )
1498 except NotImplementedError:
1499 # generally if we have numeric_only=False
1500 # and non-applicable functions
1501 # try to python agg
1502 # TODO: shouldn't min_count matter?
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:959, in BaseGrouper._cython_operation(self, kind, values, how, axis, min_count, **kwargs)
958 ngroups = self.ngroups
--> 959 return cy_op.cython_operation(
960 values=values,
961 axis=axis,
962 min_count=min_count,
963 comp_ids=ids,
964 ngroups=ngroups,
965 **kwargs,
966 )
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:657, in WrappedCythonOp.cython_operation(self, values, axis, min_count, comp_ids, ngroups, **kwargs)
649 return self._ea_wrap_cython_operation(
650 values,
651 min_count=min_count,
(...)
654 **kwargs,
655 )
--> 657 return self._cython_op_ndim_compat(
658 values,
659 min_count=min_count,
660 ngroups=ngroups,
661 comp_ids=comp_ids,
662 mask=None,
663 **kwargs,
664 )
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:497, in WrappedCythonOp._cython_op_ndim_compat(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
495 return res.T
--> 497 return self._call_cython_op(
498 values,
499 min_count=min_count,
500 ngroups=ngroups,
501 comp_ids=comp_ids,
502 mask=mask,
503 result_mask=result_mask,
504 **kwargs,
505 )
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:541, in WrappedCythonOp._call_cython_op(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
540 out_shape = self._get_output_shape(ngroups, values)
--> 541 func = self._get_cython_function(self.kind, self.how, values.dtype, is_numeric)
542 values = self._get_cython_vals(values)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:173, in WrappedCythonOp._get_cython_function(cls, kind, how, dtype, is_numeric)
171 if "object" not in f.__signatures__:
172 # raise NotImplementedError here rather than TypeError later
--> 173 raise NotImplementedError(
174 f"function is not implemented for this dtype: "
175 f"[how->{how},dtype->{dtype_str}]"
176 )
177 return f
NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1692, in _ensure_numeric(x)
1691 try:
-> 1692 x = float(x)
1693 except (TypeError, ValueError):
1694 # e.g. "1+1j" or "foo"
ValueError: could not convert string to float: 'The Birth of a NationIntoleranceBroken Blossoms'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1696, in _ensure_numeric(x)
1695 try:
-> 1696 x = complex(x)
1697 except ValueError as err:
1698 # e.g. "foo"
ValueError: complex() arg is a malformed string
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[24], line 1
----> 1 filmi_po_desetletjih.mean()
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1855, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
1853 return self._numba_agg_general(sliding_mean, engine_kwargs)
1854 else:
-> 1855 result = self._cython_agg_general(
1856 "mean",
1857 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
1858 numeric_only=numeric_only,
1859 )
1860 return result.__finalize__(self.obj, method="groupby")
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1507, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
1503 result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
1505 return result
-> 1507 new_mgr = data.grouped_reduce(array_func)
1508 res = self._wrap_agged_manager(new_mgr)
1509 out = self._wrap_aggregated_output(res)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/internals/managers.py:1503, in BlockManager.grouped_reduce(self, func)
1499 if blk.is_object:
1500 # split on object-dtype blocks bc some columns may raise
1501 # while others do not.
1502 for sb in blk._split():
-> 1503 applied = sb.apply(func)
1504 result_blocks = extend_blocks(applied, result_blocks)
1505 else:
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/internals/blocks.py:329, in Block.apply(self, func, **kwargs)
323 @final
324 def apply(self, func, **kwargs) -> list[Block]:
325 """
326 apply the function to my values; return a block if we are not
327 one
328 """
--> 329 result = func(self.values, **kwargs)
331 return self._split_op_result(result)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1503, in GroupBy._cython_agg_general.<locals>.array_func(values)
1490 result = self.grouper._cython_operation(
1491 "aggregate",
1492 values,
(...)
1496 **kwargs,
1497 )
1498 except NotImplementedError:
1499 # generally if we have numeric_only=False
1500 # and non-applicable functions
1501 # try to python agg
1502 # TODO: shouldn't min_count matter?
-> 1503 result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
1505 return result
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1457, in GroupBy._agg_py_fallback(self, values, ndim, alt)
1452 ser = df.iloc[:, 0]
1454 # We do not get here with UDFs, so we know that our dtype
1455 # should always be preserved by the implemented aggregations
1456 # TODO: Is this exactly right; see WrappedCythonOp get_result_dtype?
-> 1457 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
1459 if isinstance(values, Categorical):
1460 # Because we only get here with known dtype-preserving
1461 # reductions, we cast back to Categorical.
1462 # TODO: if we ever get "rank" working, exclude it here.
1463 res_values = type(values)._from_sequence(res_values, dtype=values.dtype)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:994, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
987 if len(obj) > 0 and not isinstance(obj._values, np.ndarray):
988 # we can preserve a little bit more aggressively with EA dtype
989 # because maybe_cast_pointwise_result will do a try/except
990 # with _from_sequence. NB we are assuming here that _from_sequence
991 # is sufficiently strict that it casts appropriately.
992 preserve_dtype = True
--> 994 result = self._aggregate_series_pure_python(obj, func)
996 npvalues = lib.maybe_convert_objects(result, try_float=False)
997 if preserve_dtype:
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:1015, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
1012 splitter = self._get_splitter(obj, axis=0)
1014 for i, group in enumerate(splitter):
-> 1015 res = func(group)
1016 res = libreduction.extract_result(res)
1018 if not initialized:
1019 # We only do this validation on the first iteration
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1857, in GroupBy.mean.<locals>.<lambda>(x)
1853 return self._numba_agg_general(sliding_mean, engine_kwargs)
1854 else:
1855 result = self._cython_agg_general(
1856 "mean",
-> 1857 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
1858 numeric_only=numeric_only,
1859 )
1860 return result.__finalize__(self.obj, method="groupby")
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/generic.py:11563, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, numeric_only, **kwargs)
11546 @doc(
11547 _num_doc,
11548 desc="Return the mean of the values over the requested axis.",
(...)
11561 **kwargs,
11562 ):
> 11563 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/generic.py:11208, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
11201 def mean(
11202 self,
11203 axis: Axis | None = 0,
(...)
11206 **kwargs,
11207 ) -> Series | float:
> 11208 return self._stat_function(
11209 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
11210 )
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/generic.py:11165, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
11161 nv.validate_stat_func((), kwargs, fname=name)
11163 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11165 return self._reduce(
11166 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11167 )
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/series.py:4671, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4666 raise TypeError(
4667 f"Series.{name} does not allow {kwd_name}={numeric_only} "
4668 "with non-numeric dtypes."
4669 )
4670 with np.errstate(all="ignore"):
-> 4671 return op(delegate, skipna=skipna, **kwds)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:96, in disallow.__call__.<locals>._f(*args, **kwargs)
94 try:
95 with np.errstate(invalid="ignore"):
---> 96 return f(*args, **kwargs)
97 except ValueError as e:
98 # we want to transform an object array
99 # ValueError message to the more typical TypeError
100 # e.g. this is normally a disallowed function on
101 # object arrays that contain strings
102 if is_object_dtype(args[0]):
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:158, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
156 result = alt(values, axis=axis, skipna=skipna, **kwds)
157 else:
--> 158 result = alt(values, axis=axis, skipna=skipna, **kwds)
160 return result
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:421, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
418 if datetimelike and mask is None:
419 mask = isna(values)
--> 421 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
423 if datetimelike:
424 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:727, in nanmean(values, axis, skipna, mask)
724 dtype_count = dtype
726 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 727 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
729 if axis is not None and getattr(the_sum, "ndim", False):
730 count = cast(np.ndarray, count)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1699, in _ensure_numeric(x)
1696 x = complex(x)
1697 except ValueError as err:
1698 # e.g. "foo"
-> 1699 raise TypeError(f"Could not convert {x} to numeric") from err
1700 return x
TypeError: Could not convert The Birth of a NationIntoleranceBroken Blossoms to numeric
Naloga#
Izračunajte število filmov posamezne dolžine, zaokrožene na 5 minut.
Risanje grafov#
Običajen graf dobimo z metodo plot
. Uporabljamo ga, kadar želimo prikazati spreminjanje vrednosti v odvisnosti od zvezne spremenljivke. Naša hipoteza je, da so zlata leta filma mimo. Graf to zanika.
filmi[filmi.ocena > 9].groupby('desetletje').size().plot()
Razsevni diagram dobimo z metodo plot.scatter
. Uporabljamo ga, če želimo ugotoviti povezavo med dvema spremenljivkama.
filmi.plot.scatter('ocena', 'metascore')
filmi[filmi.dolzina < 250].plot.scatter('dolzina', 'ocena')
Stolpčni diagram dobimo z metodo plot.bar
. Uporabljamo ga, če želimo primerjati vrednosti pri diskretnih (običajno kategoričnih) spremenljivkah. Pogosto je koristno, da graf uredimo po vrednostih.
filmi.sort_values('zasluzek', ascending=False).head(20).plot.bar(x='naslov', y='zasluzek')
Naloga#
Narišite grafe, ki ustrezno kažejo:
Povezavo med IMDB in metascore oceno
Spreminjanje povprečne dolžine filmov skozi leta
Stikanje#
osebe = pd.read_csv('podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('podatki/vloge.csv')
zanri = pd.read_csv('podatki/zanri.csv')
Razpredelnice stikamo s funkcijo merge
, ki vrne razpredelnico vnosov iz obeh tabel, pri katerih se vsi istoimenski podatki ujemajo.
vloge[vloge.film == 12349]
zanri[zanri.film == 12349]
pd.merge(vloge, zanri).head(20)
V osnovi vsebuje staknjena razpredelnica le tiste vnose, ki se pojavijo v obeh tabelah. Temu principu pravimo notranji stik (inner join). Lahko pa se odločimo, da izberemo tudi tiste vnose, ki imajo podatke le v levi tabeli (left join), le v desni tabeli (right join) ali v vsaj eni tabeli (outer join). Če v eni tabeli ni vnosov, bodo v staknjeni tabeli označene manjkajoče vrednosti. Ker smo v našem primeru podatke jemali iz IMDBja, kjer so za vsak film določeni tako žanri kot vloge, do razlik ne pride.
Včasih želimo stikati tudi po stolpcih z različnimi imeni. V tem primeru funkciji merge
podamo argumenta left_on
in right_on
.
pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
Poglejmo, katera osebe so nastopale v največ komedijah.
zanri_oseb = pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
zanri_oseb[
(zanri_oseb.zanr == 'Comedy') &
(zanri_oseb.vloga == 'I')
].groupby(
'ime'
).size(
).sort_values(
ascending=False
).head(20)
Naloga#
Izračunajte povprečno oceno vsakega žanra.
Kateri režiserji snemajo najdonosnejše filme?