Knjižnica Pandas#

Spodaj je pregled najosnovnejših metod, ki jih ponuja knjižnica Pandas. Vsaka od naštetih metod ponuja še cel kup dodatnih možnosti, ki so natančno opisane v uradni dokumentaciji. Z branjem dokumentacije se vam seveda najbolj splača začeti pri uvodih.

Predpriprava#

# naložimo paket
import pandas as pd

# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('podatki/filmi.csv', index_col='id')

# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 20 vrstic
pd.options.display.max_rows = 20

Osnovni izbori elementov razpredelnic#

Z metodo .head(n=5) pogledamo prvih n, z metodo .tail(n=5) pa zadnjih n vrstic razpredelnice.

filmi.head(10)
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
4972 The Birth of a Nation 195 1915 6.2 NaN 24890 10000000.0 TV-PG The Stoneman family finds its friendship with ...
6864 Intolerance 197 1916 7.7 99.0 15670 2180000.0 Passed The story of a poor young woman separated by p...
9968 Broken Blossoms 90 1919 7.3 NaN 10423 NaN Not Rated A frail waif, abused by her brutal boxer fathe...
10323 The Cabinet of Dr. Caligari 67 1920 8.0 NaN 64133 NaN Not Rated Hypnotist Dr. Caligari uses a somnambulist, Ce...
12349 The Kid 68 1921 8.3 NaN 126513 5450000.0 Passed The Tramp cares for an abandoned child, but ev...
12364 The Phantom Carriage 107 1921 8.0 NaN 12624 NaN Not Rated On New Year's Eve, the driver of a ghostly car...
13442 Nosferatu 94 1922 7.9 NaN 97589 NaN Not Rated Vampire Count Orlok expresses interest in a ne...
14341 Our Hospitality 65 1923 7.8 NaN 11428 1172499.0 Passed A man returns to his Appalachian homestead. On...
14429 Safety Last! 74 1923 8.1 NaN 20887 1359903.0 Not Rated A boy leaves his small country town and heads ...
15064 The Last Laugh 90 1924 8.0 NaN 14150 94812.0 Not Rated An aging doorman is forced to face the scorn o...
filmi.tail()
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
18568902 Kaun Pravin Tambe? 134 2022 8.4 NaN 10163 NaN NaN An indian cricketer who shows persistence and ...
18689424 Batman v Superman: Dawn of Justice - Ultimate ... 182 2016 7.1 NaN 57662 NaN R Batman is manipulated by Lex Luthor to fear Su...
18968540 Incantation 110 2022 6.2 NaN 12366 NaN TV-MA Six years ago, Li Ronan was cursed after break...
20850406 Sita Ramam 163 2022 8.5 NaN 38490 NaN NaN An orphan soldier, Lieutenant Ram's life chang...
21279138 Maid in Malacañang 114 2022 3.9 NaN 15273 NaN NaN The Last Days of Ferdinand and Imelda Marcos t...

Z rezinami pa dostopamo do izbranih vrstic.

filmi[3:10:2]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
10323 The Cabinet of Dr. Caligari 67 1920 8.0 NaN 64133 NaN Not Rated Hypnotist Dr. Caligari uses a somnambulist, Ce...
12364 The Phantom Carriage 107 1921 8.0 NaN 12624 NaN Not Rated On New Year's Eve, the driver of a ghostly car...
14341 Our Hospitality 65 1923 7.8 NaN 11428 1172499.0 Passed A man returns to his Appalachian homestead. On...
15064 The Last Laugh 90 1924 8.0 NaN 14150 94812.0 Not Rated An aging doorman is forced to face the scorn o...

Z indeksiranjem razpredelnice dostopamo do posameznih stolpcev.

filmi['ocena']
id
4972        6.2
6864        7.7
9968        7.3
10323       8.0
12349       8.3
           ... 
18568902    8.4
18689424    7.1
18968540    6.2
20850406    8.5
21279138    3.9
Name: ocena, Length: 9999, dtype: float64

Do stolpcev pogosto dostopamo, zato lahko uporabimo tudi krajši zapis.

filmi.ocena
id
4972        6.2
6864        7.7
9968        7.3
10323       8.0
12349       8.3
           ... 
18568902    8.4
18689424    7.1
18968540    6.2
20850406    8.5
21279138    3.9
Name: ocena, Length: 9999, dtype: float64

Če želimo več stolpcev, moramo za indeks podati seznam vseh oznak.

filmi[['naslov', 'ocena']]
naslov ocena
id
4972 The Birth of a Nation 6.2
6864 Intolerance 7.7
9968 Broken Blossoms 7.3
10323 The Cabinet of Dr. Caligari 8.0
12349 The Kid 8.3
... ... ...
18568902 Kaun Pravin Tambe? 8.4
18689424 Batman v Superman: Dawn of Justice - Ultimate ... 7.1
18968540 Incantation 6.2
20850406 Sita Ramam 8.5
21279138 Maid in Malacañang 3.9

9999 rows × 2 columns

Do vrednosti z indeksom i dostopamo z .iloc[i], do tiste s ključem k pa z .loc[k].

filmi.iloc[120]
naslov                                                 Rebecca
dolzina                                                    130
leto                                                      1940
ocena                                                      8.1
metascore                                                 86.0
glasovi                                                 137358
zasluzek                                             4360000.0
oznaka                                                Approved
opis         A self-conscious woman juggles adjusting to he...
Name: 32976, dtype: object
filmi.loc[97576]
naslov                      Indiana Jones and the Last Crusade
dolzina                                                    127
leto                                                      1989
ocena                                                      8.2
metascore                                                 65.0
glasovi                                                 750654
zasluzek                                           197171806.0
oznaka                                                   PG-13
opis         In 1938, after his father Professor Henry Jone...
Name: 97576, dtype: object

Filtriranje#

Izbor določenih vrstic razpredelnice naredimo tako, da za indeks podamo stolpec logičnih vrednosti, ki ga dobimo z običajnimi operacijami. V vrnjeni razpredelnici bodo ostale vrstice, pri katerih je v stolpcu vrednost True.

filmi.ocena >= 8
id
4972        False
6864        False
9968        False
10323        True
12349        True
            ...  
18568902     True
18689424    False
18968540    False
20850406     True
21279138    False
Name: ocena, Length: 9999, dtype: bool
filmi[filmi.ocena >= 9.3]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
111161 The Shawshank Redemption 142 1994 9.3 81.0 2651625 28341469.0 R Two imprisoned men bond over a number of years...
15327088 Kantara 148 2022 9.4 NaN 33294 NaN NaN It involves culture of Kambla and Bhootha Kola...
filmi[(filmi.leto > 2010) & (filmi.ocena > 8) | (filmi.ocena < 5)]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
52077 Plan 9 from Outer Space 79 1957 3.9 56.0 38744 NaN Not Rated Evil aliens attack Earth and set their terribl...
54673 The Beast of Yucca Flats 54 1961 1.8 NaN 11242 NaN Unrated A defecting Soviet scientist is hit by a nucle...
58548 Santa Claus Conquers the Martians 81 1964 2.6 NaN 11838 NaN Not Rated The Martians kidnap Santa Claus because there ...
59464 Monster a Go-Go 68 1965 1.7 NaN 11138 NaN TV-PG A space capsule crash-lands on Earth, and the ...
60666 Manos: The Hands of Fate 70 1966 1.6 NaN 36445 NaN Not Rated A family gets lost on the road and stumbles up...
... ... ... ... ... ... ... ... ... ...
15654262 Chup 135 2022 8.4 NaN 13098 NaN NaN A psychopath killer, targeting film critics. T...
16492678 Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi... 87 2021 9.0 NaN 12634 NaN NaN Tanjiro ventures to the south-southeast where ...
18568902 Kaun Pravin Tambe? 134 2022 8.4 NaN 10163 NaN NaN An indian cricketer who shows persistence and ...
20850406 Sita Ramam 163 2022 8.5 NaN 38490 NaN NaN An orphan soldier, Lieutenant Ram's life chang...
21279138 Maid in Malacañang 114 2022 3.9 NaN 15273 NaN NaN The Last Days of Ferdinand and Imelda Marcos t...

695 rows × 9 columns

Naloga#

Poiščite filme, ki si jih želimo izogniti za vsako ceno, torej tiste, ki so daljši od dveh ur in imajo oceno pod 4.

filmi[(filmi.dolzina > 120) & (filmi.ocena < 4) & (filmi.glasovi > 50000)]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
118688 Batman & Robin 125 1997 3.7 28.0 253972 107325195.0 PG-13 Batman and Robin try to keep their relationshi...
120179 Speed 2: Cruise Control 121 1997 3.9 23.0 81714 48608066.0 PG-13 A computer hacker breaks into the computer sys...
2574698 Gunday 152 2014 2.6 NaN 59270 NaN Not Rated The lives of Calcutta's most powerful Gunday, ...
7886848 Sadak 2 133 2020 1.1 NaN 95865 NaN TV-MA The film picks up where Sadak left off, revolv...
10350922 Laxmii 141 2020 2.6 NaN 57411 NaN TV-MA Aasif visits his wife's parents' house and hap...
10888594 Radhe 135 2021 1.9 NaN 177814 NaN TV-MA After taking the dreaded gangster Gani Bhai, A...

Urejanje#

Razpredelnico urejamo z metodo .sort_values, ki ji podamo ime ali seznam imen stolpcev, po katerih želimo urejati. Po želji lahko tudi povemo, kateri stolpci naj bodo urejeni naraščajoče in kateri padajoče.

filmi.sort_values('dolzina')
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
2061702 To the Forest of Firefly Lights 45 2011 7.8 NaN 18535 NaN NaN Hotaru is rescued by a spirit when she gets lo...
15324 Sherlock Jr. 45 1924 8.2 NaN 50180 977375.0 Passed A film projectionist longs to be a detective, ...
2591814 The Garden of Words 46 2013 7.4 NaN 44624 NaN TV-14 A 15-year-old boy and 27-year-old woman find a...
275230 Blood: The Last Vampire 48 2000 6.6 44.0 12761 NaN Not Rated Saya is a Japanese vampire slayer whose next m...
142236 Dragon Ball Z: Revival Fusion 51 1995 7.6 NaN 11050 NaN PG The universe is thrown into dimensional chaos ...
... ... ... ... ... ... ... ... ... ...
107007 Gettysburg 271 1993 7.6 NaN 29479 10769960.0 PG In 1863, the Northern and Southern forces figh...
74084 1900 317 1976 7.7 70.0 25679 NaN Unrated The epic tale of a class struggle in twentieth...
1954470 Gangs of Wasseypur 321 2012 8.2 89.0 96141 NaN Not Rated A clash between Sultan and Shahid Khan leads t...
346336 The Best of Youth 366 2003 8.5 89.0 22119 274024.0 R An Italian epic that follows the lives of two ...
111341 Satantango 439 1994 8.3 NaN 11214 NaN Not Rated On the eve of a large payment, residents of a ...

9999 rows × 9 columns

# najprej uredi padajoče po oceni, pri vsaki oceni pa še naraščajoče po letu
filmi.sort_values(['ocena', 'leto'], ascending=[False, True])
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
15327088 Kantara 148 2022 9.4 NaN 33294 NaN NaN It involves culture of Kambla and Bhootha Kola...
111161 The Shawshank Redemption 142 1994 9.3 81.0 2651625 28341469.0 R Two imprisoned men bond over a number of years...
68646 The Godfather 175 1972 9.2 100.0 1838099 134966411.0 R The aging patriarch of an organized crime dyna...
252487 The Chaos Class 87 1975 9.2 NaN 40747 NaN NaN Lazy, uneducated students share a very close b...
50083 12 Angry Men 96 1957 9.0 96.0 782923 4360000.0 Approved The jury in a New York City murder trial is fr...
... ... ... ... ... ... ... ... ... ...
421051 Daniel the Wizard 81 2004 1.2 NaN 14413 NaN Not Rated Evil assassins want to kill Daniel Kublbock, t...
6038600 Smolensk 120 2016 1.2 NaN 39704 NaN NaN An inspired story of people affected by the tr...
7886848 Sadak 2 133 2020 1.1 NaN 95865 NaN TV-MA The film picks up where Sadak left off, revolv...
5988370 Reis 108 2017 1.0 NaN 73382 NaN NaN A drama about the early life of Recep Tayyip E...
7221896 Cumali Ceber 100 2017 1.0 NaN 38958 NaN NaN Cumali Ceber goes to a vacation with his child...

9999 rows × 9 columns

Združevanje#

Z metodo .groupby ustvarimo razpredelnico posebne vrste, v katerem so vrstice združene glede na skupno lastnost.

filmi_po_letih = filmi.groupby('leto')
filmi_po_letih
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb7d8423ee0>
# povprečna ocena vsakega leta
filmi_po_letih.ocena.mean()
leto
1915    6.200000
1916    7.700000
1919    7.300000
1920    8.000000
1921    8.150000
          ...   
2018    6.430748
2019    6.493051
2020    6.144304
2021    6.369742
2022    6.361628
Name: ocena, Length: 106, dtype: float64

Če želimo, lahko združujemo tudi po izračunanih lastnostih. Izračunajmo stolpec in ga shranimo v razpredelnico.

filmi['desetletje'] = 10 * (filmi.leto // 10)
filmi
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis desetletje
id
4972 The Birth of a Nation 195 1915 6.2 NaN 24890 10000000.0 TV-PG The Stoneman family finds its friendship with ... 1910
6864 Intolerance 197 1916 7.7 99.0 15670 2180000.0 Passed The story of a poor young woman separated by p... 1910
9968 Broken Blossoms 90 1919 7.3 NaN 10423 NaN Not Rated A frail waif, abused by her brutal boxer fathe... 1910
10323 The Cabinet of Dr. Caligari 67 1920 8.0 NaN 64133 NaN Not Rated Hypnotist Dr. Caligari uses a somnambulist, Ce... 1920
12349 The Kid 68 1921 8.3 NaN 126513 5450000.0 Passed The Tramp cares for an abandoned child, but ev... 1920
... ... ... ... ... ... ... ... ... ... ...
18568902 Kaun Pravin Tambe? 134 2022 8.4 NaN 10163 NaN NaN An indian cricketer who shows persistence and ... 2020
18689424 Batman v Superman: Dawn of Justice - Ultimate ... 182 2016 7.1 NaN 57662 NaN R Batman is manipulated by Lex Luthor to fear Su... 2010
18968540 Incantation 110 2022 6.2 NaN 12366 NaN TV-MA Six years ago, Li Ronan was cursed after break... 2020
20850406 Sita Ramam 163 2022 8.5 NaN 38490 NaN NaN An orphan soldier, Lieutenant Ram's life chang... 2020
21279138 Maid in Malacañang 114 2022 3.9 NaN 15273 NaN NaN The Last Days of Ferdinand and Imelda Marcos t... 2020

9999 rows × 10 columns

filmi_po_desetletjih = filmi.groupby('desetletje')

Preštejemo, koliko filmov je bilo v vsakem desetletju. Pri večini stolpcev dobimo iste številke, ker imamo v vsakem stolpcu enako vnosov. Če kje kakšen podatek manjkal, je številka manjša.

filmi_po_desetletjih.count()
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
desetletje
1910 3 3 3 3 1 3 2 3 3
1920 27 27 27 27 4 27 18 27 27
1930 80 80 80 80 39 80 36 80 80
1940 134 134 134 134 63 134 46 133 134
1950 205 205 205 205 113 205 92 205 205
1960 284 284 284 284 172 284 150 281 284
1970 410 410 410 410 323 410 276 394 410
1980 823 823 823 823 721 823 711 809 823
1990 1420 1420 1420 1420 1128 1420 1324 1399 1420
2000 2575 2575 2575 2575 2183 2575 2203 2507 2575
2010 3358 3358 3358 3358 2728 3358 2353 3228 3358
2020 680 680 680 680 500 680 47 586 680

Če želimo dobiti le število članov posamezne skupine, uporabimo metodo .size(). V tem primeru dobimo le stolpec, ne razpredelnice.

filmi_po_desetletjih.size()
desetletje
1910       3
1920      27
1930      80
1940     134
1950     205
1960     284
1970     410
1980     823
1990    1420
2000    2575
2010    3358
2020     680
dtype: int64

Pogledamo povprečja vsakega desetletja. Dobimo povprečno leto, dolžino, ocene in zaslužek. Povprečnega naslova ne dobimo, ker se ga ne da izračunati, zato ustreznega stolpca ni.

filmi_po_desetletjih.mean()
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1490, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1489 try:
-> 1490     result = self.grouper._cython_operation(
   1491         "aggregate",
   1492         values,
   1493         how,
   1494         axis=data.ndim - 1,
   1495         min_count=min_count,
   1496         **kwargs,
   1497     )
   1498 except NotImplementedError:
   1499     # generally if we have numeric_only=False
   1500     # and non-applicable functions
   1501     # try to python agg
   1502     # TODO: shouldn't min_count matter?

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:959, in BaseGrouper._cython_operation(self, kind, values, how, axis, min_count, **kwargs)
    958 ngroups = self.ngroups
--> 959 return cy_op.cython_operation(
    960     values=values,
    961     axis=axis,
    962     min_count=min_count,
    963     comp_ids=ids,
    964     ngroups=ngroups,
    965     **kwargs,
    966 )

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:657, in WrappedCythonOp.cython_operation(self, values, axis, min_count, comp_ids, ngroups, **kwargs)
    649     return self._ea_wrap_cython_operation(
    650         values,
    651         min_count=min_count,
   (...)
    654         **kwargs,
    655     )
--> 657 return self._cython_op_ndim_compat(
    658     values,
    659     min_count=min_count,
    660     ngroups=ngroups,
    661     comp_ids=comp_ids,
    662     mask=None,
    663     **kwargs,
    664 )

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:497, in WrappedCythonOp._cython_op_ndim_compat(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
    495     return res.T
--> 497 return self._call_cython_op(
    498     values,
    499     min_count=min_count,
    500     ngroups=ngroups,
    501     comp_ids=comp_ids,
    502     mask=mask,
    503     result_mask=result_mask,
    504     **kwargs,
    505 )

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:541, in WrappedCythonOp._call_cython_op(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
    540 out_shape = self._get_output_shape(ngroups, values)
--> 541 func = self._get_cython_function(self.kind, self.how, values.dtype, is_numeric)
    542 values = self._get_cython_vals(values)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:173, in WrappedCythonOp._get_cython_function(cls, kind, how, dtype, is_numeric)
    171 if "object" not in f.__signatures__:
    172     # raise NotImplementedError here rather than TypeError later
--> 173     raise NotImplementedError(
    174         f"function is not implemented for this dtype: "
    175         f"[how->{how},dtype->{dtype_str}]"
    176     )
    177 return f

NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1692, in _ensure_numeric(x)
   1691 try:
-> 1692     x = float(x)
   1693 except (TypeError, ValueError):
   1694     # e.g. "1+1j" or "foo"

ValueError: could not convert string to float: 'The Birth of a NationIntoleranceBroken Blossoms'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1696, in _ensure_numeric(x)
   1695 try:
-> 1696     x = complex(x)
   1697 except ValueError as err:
   1698     # e.g. "foo"

ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 filmi_po_desetletjih.mean()

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1855, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   1853     return self._numba_agg_general(sliding_mean, engine_kwargs)
   1854 else:
-> 1855     result = self._cython_agg_general(
   1856         "mean",
   1857         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   1858         numeric_only=numeric_only,
   1859     )
   1860     return result.__finalize__(self.obj, method="groupby")

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1507, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1503         result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
   1505     return result
-> 1507 new_mgr = data.grouped_reduce(array_func)
   1508 res = self._wrap_agged_manager(new_mgr)
   1509 out = self._wrap_aggregated_output(res)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/internals/managers.py:1503, in BlockManager.grouped_reduce(self, func)
   1499 if blk.is_object:
   1500     # split on object-dtype blocks bc some columns may raise
   1501     #  while others do not.
   1502     for sb in blk._split():
-> 1503         applied = sb.apply(func)
   1504         result_blocks = extend_blocks(applied, result_blocks)
   1505 else:

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/internals/blocks.py:329, in Block.apply(self, func, **kwargs)
    323 @final
    324 def apply(self, func, **kwargs) -> list[Block]:
    325     """
    326     apply the function to my values; return a block if we are not
    327     one
    328     """
--> 329     result = func(self.values, **kwargs)
    331     return self._split_op_result(result)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1503, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1490     result = self.grouper._cython_operation(
   1491         "aggregate",
   1492         values,
   (...)
   1496         **kwargs,
   1497     )
   1498 except NotImplementedError:
   1499     # generally if we have numeric_only=False
   1500     # and non-applicable functions
   1501     # try to python agg
   1502     # TODO: shouldn't min_count matter?
-> 1503     result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
   1505 return result

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1457, in GroupBy._agg_py_fallback(self, values, ndim, alt)
   1452     ser = df.iloc[:, 0]
   1454 # We do not get here with UDFs, so we know that our dtype
   1455 #  should always be preserved by the implemented aggregations
   1456 # TODO: Is this exactly right; see WrappedCythonOp get_result_dtype?
-> 1457 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
   1459 if isinstance(values, Categorical):
   1460     # Because we only get here with known dtype-preserving
   1461     #  reductions, we cast back to Categorical.
   1462     # TODO: if we ever get "rank" working, exclude it here.
   1463     res_values = type(values)._from_sequence(res_values, dtype=values.dtype)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:994, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    987 if len(obj) > 0 and not isinstance(obj._values, np.ndarray):
    988     # we can preserve a little bit more aggressively with EA dtype
    989     #  because maybe_cast_pointwise_result will do a try/except
    990     #  with _from_sequence.  NB we are assuming here that _from_sequence
    991     #  is sufficiently strict that it casts appropriately.
    992     preserve_dtype = True
--> 994 result = self._aggregate_series_pure_python(obj, func)
    996 npvalues = lib.maybe_convert_objects(result, try_float=False)
    997 if preserve_dtype:

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:1015, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
   1012 splitter = self._get_splitter(obj, axis=0)
   1014 for i, group in enumerate(splitter):
-> 1015     res = func(group)
   1016     res = libreduction.extract_result(res)
   1018     if not initialized:
   1019         # We only do this validation on the first iteration

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1857, in GroupBy.mean.<locals>.<lambda>(x)
   1853     return self._numba_agg_general(sliding_mean, engine_kwargs)
   1854 else:
   1855     result = self._cython_agg_general(
   1856         "mean",
-> 1857         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   1858         numeric_only=numeric_only,
   1859     )
   1860     return result.__finalize__(self.obj, method="groupby")

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/generic.py:11563, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, numeric_only, **kwargs)
  11546 @doc(
  11547     _num_doc,
  11548     desc="Return the mean of the values over the requested axis.",
   (...)
  11561     **kwargs,
  11562 ):
> 11563     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/generic.py:11208, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11201 def mean(
  11202     self,
  11203     axis: Axis | None = 0,
   (...)
  11206     **kwargs,
  11207 ) -> Series | float:
> 11208     return self._stat_function(
  11209         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11210     )

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/generic.py:11165, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11161     nv.validate_stat_func((), kwargs, fname=name)
  11163 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11165 return self._reduce(
  11166     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11167 )

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/series.py:4671, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   4666     raise TypeError(
   4667         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   4668         "with non-numeric dtypes."
   4669     )
   4670 with np.errstate(all="ignore"):
-> 4671     return op(delegate, skipna=skipna, **kwds)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:96, in disallow.__call__.<locals>._f(*args, **kwargs)
     94 try:
     95     with np.errstate(invalid="ignore"):
---> 96         return f(*args, **kwargs)
     97 except ValueError as e:
     98     # we want to transform an object array
     99     # ValueError message to the more typical TypeError
    100     # e.g. this is normally a disallowed function on
    101     # object arrays that contain strings
    102     if is_object_dtype(args[0]):

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:158, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    156         result = alt(values, axis=axis, skipna=skipna, **kwds)
    157 else:
--> 158     result = alt(values, axis=axis, skipna=skipna, **kwds)
    160 return result

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:421, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    418 if datetimelike and mask is None:
    419     mask = isna(values)
--> 421 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    423 if datetimelike:
    424     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:727, in nanmean(values, axis, skipna, mask)
    724     dtype_count = dtype
    726 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 727 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
    729 if axis is not None and getattr(the_sum, "ndim", False):
    730     count = cast(np.ndarray, count)

File /opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1699, in _ensure_numeric(x)
   1696             x = complex(x)
   1697         except ValueError as err:
   1698             # e.g. "foo"
-> 1699             raise TypeError(f"Could not convert {x} to numeric") from err
   1700 return x

TypeError: Could not convert The Birth of a NationIntoleranceBroken Blossoms to numeric

Naloga#

Izračunajte število filmov posamezne dolžine, zaokrožene na 5 minut.

Risanje grafov#

Običajen graf dobimo z metodo plot. Uporabljamo ga, kadar želimo prikazati spreminjanje vrednosti v odvisnosti od zvezne spremenljivke. Naša hipoteza je, da so zlata leta filma mimo. Graf to zanika.

filmi[filmi.ocena > 9].groupby('desetletje').size().plot()

Razsevni diagram dobimo z metodo plot.scatter. Uporabljamo ga, če želimo ugotoviti povezavo med dvema spremenljivkama.

filmi.plot.scatter('ocena', 'metascore')
filmi[filmi.dolzina < 250].plot.scatter('dolzina', 'ocena')

Stolpčni diagram dobimo z metodo plot.bar. Uporabljamo ga, če želimo primerjati vrednosti pri diskretnih (običajno kategoričnih) spremenljivkah. Pogosto je koristno, da graf uredimo po vrednostih.

filmi.sort_values('zasluzek', ascending=False).head(20).plot.bar(x='naslov', y='zasluzek')

Naloga#

Narišite grafe, ki ustrezno kažejo:

  • Povezavo med IMDB in metascore oceno

  • Spreminjanje povprečne dolžine filmov skozi leta

Stikanje#

osebe = pd.read_csv('podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('podatki/vloge.csv')
zanri = pd.read_csv('podatki/zanri.csv')

Razpredelnice stikamo s funkcijo merge, ki vrne razpredelnico vnosov iz obeh tabel, pri katerih se vsi istoimenski podatki ujemajo.

vloge[vloge.film == 12349]
zanri[zanri.film == 12349]
pd.merge(vloge, zanri).head(20)

V osnovi vsebuje staknjena razpredelnica le tiste vnose, ki se pojavijo v obeh tabelah. Temu principu pravimo notranji stik (inner join). Lahko pa se odločimo, da izberemo tudi tiste vnose, ki imajo podatke le v levi tabeli (left join), le v desni tabeli (right join) ali v vsaj eni tabeli (outer join). Če v eni tabeli ni vnosov, bodo v staknjeni tabeli označene manjkajoče vrednosti. Ker smo v našem primeru podatke jemali iz IMDBja, kjer so za vsak film določeni tako žanri kot vloge, do razlik ne pride.

Včasih želimo stikati tudi po stolpcih z različnimi imeni. V tem primeru funkciji merge podamo argumenta left_on in right_on.

pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')

Poglejmo, katera osebe so nastopale v največ komedijah.

zanri_oseb = pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
zanri_oseb[
    (zanri_oseb.zanr == 'Comedy') &
    (zanri_oseb.vloga == 'I')
].groupby(
    'ime'
).size(
).sort_values(
    ascending=False
).head(20)

Naloga#

  • Izračunajte povprečno oceno vsakega žanra.

  • Kateri režiserji snemajo najdonosnejše filme?