Knjižnica Pandas#

Spodaj je pregled najosnovnejših metod, ki jih ponuja knjižnica Pandas. Vsaka od naštetih metod ponuja še cel kup dodatnih možnosti, ki so natančno opisane v uradni dokumentaciji. Z branjem dokumentacije se vam seveda najbolj splača začeti pri uvodih.

Predpriprava#

# naložimo paket
import pandas as pd

# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('podatki/filmi.csv', index_col='id')

# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 20 vrstic
pd.options.display.max_rows = 20

Osnovni izbori elementov razpredelnic#

Z metodo .head(n=5) pogledamo prvih n, z metodo .tail(n=5) pa zadnjih n vrstic razpredelnice.

filmi.head(10)
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
4972 The Birth of a Nation 195 1915 6.2 NaN 24890 10000000.0 TV-PG The Stoneman family finds its friendship with ...
6864 Intolerance 197 1916 7.7 99.0 15670 2180000.0 Passed The story of a poor young woman separated by p...
9968 Broken Blossoms 90 1919 7.3 NaN 10423 NaN Not Rated A frail waif, abused by her brutal boxer fathe...
10323 The Cabinet of Dr. Caligari 67 1920 8.0 NaN 64133 NaN Not Rated Hypnotist Dr. Caligari uses a somnambulist, Ce...
12349 The Kid 68 1921 8.3 NaN 126513 5450000.0 Passed The Tramp cares for an abandoned child, but ev...
12364 The Phantom Carriage 107 1921 8.0 NaN 12624 NaN Not Rated On New Year's Eve, the driver of a ghostly car...
13442 Nosferatu 94 1922 7.9 NaN 97589 NaN Not Rated Vampire Count Orlok expresses interest in a ne...
14341 Our Hospitality 65 1923 7.8 NaN 11428 1172499.0 Passed A man returns to his Appalachian homestead. On...
14429 Safety Last! 74 1923 8.1 NaN 20887 1359903.0 Not Rated A boy leaves his small country town and heads ...
15064 The Last Laugh 90 1924 8.0 NaN 14150 94812.0 Not Rated An aging doorman is forced to face the scorn o...
filmi.tail()
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
18568902 Kaun Pravin Tambe? 134 2022 8.4 NaN 10163 NaN NaN An indian cricketer who shows persistence and ...
18689424 Batman v Superman: Dawn of Justice - Ultimate ... 182 2016 7.1 NaN 57662 NaN R Batman is manipulated by Lex Luthor to fear Su...
18968540 Incantation 110 2022 6.2 NaN 12366 NaN TV-MA Six years ago, Li Ronan was cursed after break...
20850406 Sita Ramam 163 2022 8.5 NaN 38490 NaN NaN An orphan soldier, Lieutenant Ram's life chang...
21279138 Maid in Malacañang 114 2022 3.9 NaN 15273 NaN NaN The Last Days of Ferdinand and Imelda Marcos t...

Z rezinami pa dostopamo do izbranih vrstic.

filmi[3:10:2]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
10323 The Cabinet of Dr. Caligari 67 1920 8.0 NaN 64133 NaN Not Rated Hypnotist Dr. Caligari uses a somnambulist, Ce...
12364 The Phantom Carriage 107 1921 8.0 NaN 12624 NaN Not Rated On New Year's Eve, the driver of a ghostly car...
14341 Our Hospitality 65 1923 7.8 NaN 11428 1172499.0 Passed A man returns to his Appalachian homestead. On...
15064 The Last Laugh 90 1924 8.0 NaN 14150 94812.0 Not Rated An aging doorman is forced to face the scorn o...

Z indeksiranjem razpredelnice dostopamo do posameznih stolpcev.

filmi['ocena']
id
4972        6.2
6864        7.7
9968        7.3
10323       8.0
12349       8.3
           ... 
18568902    8.4
18689424    7.1
18968540    6.2
20850406    8.5
21279138    3.9
Name: ocena, Length: 9999, dtype: float64

Do stolpcev pogosto dostopamo, zato lahko uporabimo tudi krajši zapis.

filmi.ocena
id
4972        6.2
6864        7.7
9968        7.3
10323       8.0
12349       8.3
           ... 
18568902    8.4
18689424    7.1
18968540    6.2
20850406    8.5
21279138    3.9
Name: ocena, Length: 9999, dtype: float64

Če želimo več stolpcev, moramo za indeks podati seznam vseh oznak.

filmi[['naslov', 'ocena']]
naslov ocena
id
4972 The Birth of a Nation 6.2
6864 Intolerance 7.7
9968 Broken Blossoms 7.3
10323 The Cabinet of Dr. Caligari 8.0
12349 The Kid 8.3
... ... ...
18568902 Kaun Pravin Tambe? 8.4
18689424 Batman v Superman: Dawn of Justice - Ultimate ... 7.1
18968540 Incantation 6.2
20850406 Sita Ramam 8.5
21279138 Maid in Malacañang 3.9

9999 rows × 2 columns

Do vrednosti z indeksom i dostopamo z .iloc[i], do tiste s ključem k pa z .loc[k].

filmi.iloc[120]
naslov                                                 Rebecca
dolzina                                                    130
leto                                                      1940
ocena                                                      8.1
metascore                                                 86.0
glasovi                                                 137358
zasluzek                                             4360000.0
oznaka                                                Approved
opis         A self-conscious woman juggles adjusting to he...
Name: 32976, dtype: object
filmi.loc[97576]
naslov                      Indiana Jones and the Last Crusade
dolzina                                                    127
leto                                                      1989
ocena                                                      8.2
metascore                                                 65.0
glasovi                                                 750654
zasluzek                                           197171806.0
oznaka                                                   PG-13
opis         In 1938, after his father Professor Henry Jone...
Name: 97576, dtype: object

Filtriranje#

Izbor določenih vrstic razpredelnice naredimo tako, da za indeks podamo stolpec logičnih vrednosti, ki ga dobimo z običajnimi operacijami. V vrnjeni razpredelnici bodo ostale vrstice, pri katerih je v stolpcu vrednost True.

filmi.ocena >= 8
id
4972        False
6864        False
9968        False
10323        True
12349        True
            ...  
18568902     True
18689424    False
18968540    False
20850406     True
21279138    False
Name: ocena, Length: 9999, dtype: bool
filmi[filmi.ocena >= 9.3]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
111161 The Shawshank Redemption 142 1994 9.3 81.0 2651625 28341469.0 R Two imprisoned men bond over a number of years...
15327088 Kantara 148 2022 9.4 NaN 33294 NaN NaN It involves culture of Kambla and Bhootha Kola...
filmi[(filmi.leto > 2010) & (filmi.ocena > 8) | (filmi.ocena < 5)]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
52077 Plan 9 from Outer Space 79 1957 3.9 56.0 38744 NaN Not Rated Evil aliens attack Earth and set their terribl...
54673 The Beast of Yucca Flats 54 1961 1.8 NaN 11242 NaN Unrated A defecting Soviet scientist is hit by a nucle...
58548 Santa Claus Conquers the Martians 81 1964 2.6 NaN 11838 NaN Not Rated The Martians kidnap Santa Claus because there ...
59464 Monster a Go-Go 68 1965 1.7 NaN 11138 NaN TV-PG A space capsule crash-lands on Earth, and the ...
60666 Manos: The Hands of Fate 70 1966 1.6 NaN 36445 NaN Not Rated A family gets lost on the road and stumbles up...
... ... ... ... ... ... ... ... ... ...
15654262 Chup 135 2022 8.4 NaN 13098 NaN NaN A psychopath killer, targeting film critics. T...
16492678 Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi... 87 2021 9.0 NaN 12634 NaN NaN Tanjiro ventures to the south-southeast where ...
18568902 Kaun Pravin Tambe? 134 2022 8.4 NaN 10163 NaN NaN An indian cricketer who shows persistence and ...
20850406 Sita Ramam 163 2022 8.5 NaN 38490 NaN NaN An orphan soldier, Lieutenant Ram's life chang...
21279138 Maid in Malacañang 114 2022 3.9 NaN 15273 NaN NaN The Last Days of Ferdinand and Imelda Marcos t...

695 rows × 9 columns

Naloga#

Poiščite filme, ki si jih želimo izogniti za vsako ceno, torej tiste, ki so daljši od dveh ur in imajo oceno pod 4.

filmi[(filmi.dolzina > 120) & (filmi.ocena < 4) & (filmi.glasovi > 50000)]
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
118688 Batman & Robin 125 1997 3.7 28.0 253972 107325195.0 PG-13 Batman and Robin try to keep their relationshi...
120179 Speed 2: Cruise Control 121 1997 3.9 23.0 81714 48608066.0 PG-13 A computer hacker breaks into the computer sys...
2574698 Gunday 152 2014 2.6 NaN 59270 NaN Not Rated The lives of Calcutta's most powerful Gunday, ...
7886848 Sadak 2 133 2020 1.1 NaN 95865 NaN TV-MA The film picks up where Sadak left off, revolv...
10350922 Laxmii 141 2020 2.6 NaN 57411 NaN TV-MA Aasif visits his wife's parents' house and hap...
10888594 Radhe 135 2021 1.9 NaN 177814 NaN TV-MA After taking the dreaded gangster Gani Bhai, A...

Urejanje#

Razpredelnico urejamo z metodo .sort_values, ki ji podamo ime ali seznam imen stolpcev, po katerih želimo urejati. Po želji lahko tudi povemo, kateri stolpci naj bodo urejeni naraščajoče in kateri padajoče.

filmi.sort_values('dolzina')
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
2061702 To the Forest of Firefly Lights 45 2011 7.8 NaN 18535 NaN NaN Hotaru is rescued by a spirit when she gets lo...
15324 Sherlock Jr. 45 1924 8.2 NaN 50180 977375.0 Passed A film projectionist longs to be a detective, ...
2591814 The Garden of Words 46 2013 7.4 NaN 44624 NaN TV-14 A 15-year-old boy and 27-year-old woman find a...
275230 Blood: The Last Vampire 48 2000 6.6 44.0 12761 NaN Not Rated Saya is a Japanese vampire slayer whose next m...
142236 Dragon Ball Z: Revival Fusion 51 1995 7.6 NaN 11050 NaN PG The universe is thrown into dimensional chaos ...
... ... ... ... ... ... ... ... ... ...
107007 Gettysburg 271 1993 7.6 NaN 29479 10769960.0 PG In 1863, the Northern and Southern forces figh...
74084 1900 317 1976 7.7 70.0 25679 NaN Unrated The epic tale of a class struggle in twentieth...
1954470 Gangs of Wasseypur 321 2012 8.2 89.0 96141 NaN Not Rated A clash between Sultan and Shahid Khan leads t...
346336 The Best of Youth 366 2003 8.5 89.0 22119 274024.0 R An Italian epic that follows the lives of two ...
111341 Satantango 439 1994 8.3 NaN 11214 NaN Not Rated On the eve of a large payment, residents of a ...

9999 rows × 9 columns

# najprej uredi padajoče po oceni, pri vsaki oceni pa še naraščajoče po letu
filmi.sort_values(['ocena', 'leto'], ascending=[False, True])
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
id
15327088 Kantara 148 2022 9.4 NaN 33294 NaN NaN It involves culture of Kambla and Bhootha Kola...
111161 The Shawshank Redemption 142 1994 9.3 81.0 2651625 28341469.0 R Two imprisoned men bond over a number of years...
68646 The Godfather 175 1972 9.2 100.0 1838099 134966411.0 R The aging patriarch of an organized crime dyna...
252487 The Chaos Class 87 1975 9.2 NaN 40747 NaN NaN Lazy, uneducated students share a very close b...
50083 12 Angry Men 96 1957 9.0 96.0 782923 4360000.0 Approved The jury in a New York City murder trial is fr...
... ... ... ... ... ... ... ... ... ...
421051 Daniel the Wizard 81 2004 1.2 NaN 14413 NaN Not Rated Evil assassins want to kill Daniel Kublbock, t...
6038600 Smolensk 120 2016 1.2 NaN 39704 NaN NaN An inspired story of people affected by the tr...
7886848 Sadak 2 133 2020 1.1 NaN 95865 NaN TV-MA The film picks up where Sadak left off, revolv...
5988370 Reis 108 2017 1.0 NaN 73382 NaN NaN A drama about the early life of Recep Tayyip E...
7221896 Cumali Ceber 100 2017 1.0 NaN 38958 NaN NaN Cumali Ceber goes to a vacation with his child...

9999 rows × 9 columns

Združevanje#

Z metodo .groupby ustvarimo razpredelnico posebne vrste, v katerem so vrstice združene glede na skupno lastnost.

filmi_po_letih = filmi.groupby('leto')
filmi_po_letih
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f5ab85bc5b0>
# povprečna ocena vsakega leta
filmi_po_letih.ocena.mean()
leto
1915    6.200000
1916    7.700000
1919    7.300000
1920    8.000000
1921    8.150000
          ...   
2018    6.430748
2019    6.493051
2020    6.144304
2021    6.369742
2022    6.361628
Name: ocena, Length: 106, dtype: float64

Če želimo, lahko združujemo tudi po izračunanih lastnostih. Izračunajmo stolpec in ga shranimo v razpredelnico.

filmi['desetletje'] = 10 * (filmi.leto // 10)
filmi
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis desetletje
id
4972 The Birth of a Nation 195 1915 6.2 NaN 24890 10000000.0 TV-PG The Stoneman family finds its friendship with ... 1910
6864 Intolerance 197 1916 7.7 99.0 15670 2180000.0 Passed The story of a poor young woman separated by p... 1910
9968 Broken Blossoms 90 1919 7.3 NaN 10423 NaN Not Rated A frail waif, abused by her brutal boxer fathe... 1910
10323 The Cabinet of Dr. Caligari 67 1920 8.0 NaN 64133 NaN Not Rated Hypnotist Dr. Caligari uses a somnambulist, Ce... 1920
12349 The Kid 68 1921 8.3 NaN 126513 5450000.0 Passed The Tramp cares for an abandoned child, but ev... 1920
... ... ... ... ... ... ... ... ... ... ...
18568902 Kaun Pravin Tambe? 134 2022 8.4 NaN 10163 NaN NaN An indian cricketer who shows persistence and ... 2020
18689424 Batman v Superman: Dawn of Justice - Ultimate ... 182 2016 7.1 NaN 57662 NaN R Batman is manipulated by Lex Luthor to fear Su... 2010
18968540 Incantation 110 2022 6.2 NaN 12366 NaN TV-MA Six years ago, Li Ronan was cursed after break... 2020
20850406 Sita Ramam 163 2022 8.5 NaN 38490 NaN NaN An orphan soldier, Lieutenant Ram's life chang... 2020
21279138 Maid in Malacañang 114 2022 3.9 NaN 15273 NaN NaN The Last Days of Ferdinand and Imelda Marcos t... 2020

9999 rows × 10 columns

filmi_po_desetletjih = filmi.groupby('desetletje')

Preštejemo, koliko filmov je bilo v vsakem desetletju. Pri večini stolpcev dobimo iste številke, ker imamo v vsakem stolpcu enako vnosov. Če kje kakšen podatek manjkal, je številka manjša.

filmi_po_desetletjih.count()
naslov dolzina leto ocena metascore glasovi zasluzek oznaka opis
desetletje
1910 3 3 3 3 1 3 2 3 3
1920 27 27 27 27 4 27 18 27 27
1930 80 80 80 80 39 80 36 80 80
1940 134 134 134 134 63 134 46 133 134
1950 205 205 205 205 113 205 92 205 205
1960 284 284 284 284 172 284 150 281 284
1970 410 410 410 410 323 410 276 394 410
1980 823 823 823 823 721 823 711 809 823
1990 1420 1420 1420 1420 1128 1420 1324 1399 1420
2000 2575 2575 2575 2575 2183 2575 2203 2507 2575
2010 3358 3358 3358 3358 2728 3358 2353 3228 3358
2020 680 680 680 680 500 680 47 586 680

Če želimo dobiti le število članov posamezne skupine, uporabimo metodo .size(). V tem primeru dobimo le stolpec, ne razpredelnice.

filmi_po_desetletjih.size()
desetletje
1910       3
1920      27
1930      80
1940     134
1950     205
1960     284
1970     410
1980     823
1990    1420
2000    2575
2010    3358
2020     680
dtype: int64

Pogledamo povprečja vsakega desetletja. Dobimo povprečno leto, dolžino, ocene in zaslužek. Povprečnega naslova ne dobimo, ker se ga ne da izračunati, zato ustreznega stolpca ni.

filmi_po_desetletjih.mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1870, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1869 try:
-> 1870     res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
   1871 except Exception as err:

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:850, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    848     preserve_dtype = True
--> 850 result = self._aggregate_series_pure_python(obj, func)
    852 npvalues = lib.maybe_convert_objects(result, try_float=False)

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:871, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    870 for i, group in enumerate(splitter):
--> 871     res = func(group)
    872     res = extract_result(res)

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2376, in GroupBy.mean.<locals>.<lambda>(x)
   2373 else:
   2374     result = self._cython_agg_general(
   2375         "mean",
-> 2376         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2377         numeric_only=numeric_only,
   2378     )
   2379     return result.__finalize__(self.obj, method="groupby")

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/series.py:6226, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6218 @doc(make_doc("mean", ndim=1))
   6219 def mean(
   6220     self,
   (...)
   6224     **kwargs,
   6225 ):
-> 6226     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/generic.py:11969, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11962 def mean(
  11963     self,
  11964     axis: Axis | None = 0,
   (...)
  11967     **kwargs,
  11968 ) -> Series | float:
> 11969     return self._stat_function(
  11970         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11971     )

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/generic.py:11926, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11924 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11926 return self._reduce(
  11927     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11928 )

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/series.py:6134, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6130     raise TypeError(
   6131         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6132         "with non-numeric dtypes."
   6133     )
-> 6134 return op(delegate, skipna=skipna, **kwds)

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1693, in _ensure_numeric(x)
   1691 if isinstance(x, str):
   1692     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1693     raise TypeError(f"Could not convert string '{x}' to numeric")
   1694 try:

TypeError: Could not convert string 'The Birth of a NationIntoleranceBroken Blossoms' to numeric

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 filmi_po_desetletjih.mean()

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2374, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   2367     return self._numba_agg_general(
   2368         grouped_mean,
   2369         executor.float_dtype_mapping,
   2370         engine_kwargs,
   2371         min_periods=0,
   2372     )
   2373 else:
-> 2374     result = self._cython_agg_general(
   2375         "mean",
   2376         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2377         numeric_only=numeric_only,
   2378     )
   2379     return result.__finalize__(self.obj, method="groupby")

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1925, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1922     result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1923     return result
-> 1925 new_mgr = data.grouped_reduce(array_func)
   1926 res = self._wrap_agged_manager(new_mgr)
   1927 out = self._wrap_aggregated_output(res)

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/internals/managers.py:1428, in BlockManager.grouped_reduce(self, func)
   1424 if blk.is_object:
   1425     # split on object-dtype blocks bc some columns may raise
   1426     #  while others do not.
   1427     for sb in blk._split():
-> 1428         applied = sb.apply(func)
   1429         result_blocks = extend_blocks(applied, result_blocks)
   1430 else:

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/internals/blocks.py:366, in Block.apply(self, func, **kwargs)
    360 @final
    361 def apply(self, func, **kwargs) -> list[Block]:
    362     """
    363     apply the function to my values; return a block if we are not
    364     one
    365     """
--> 366     result = func(self.values, **kwargs)
    368     result = maybe_coerce_values(result)
    369     return self._split_op_result(result)

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1922, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1919 else:
   1920     return result
-> 1922 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1923 return result

File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1874, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1872     msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
   1873     # preserve the kind of exception that raised
-> 1874     raise type(err)(msg) from err
   1876 if ser.dtype == object:
   1877     res_values = res_values.astype(object, copy=False)

TypeError: agg function failed [how->mean,dtype->object]

Naloga#

Izračunajte število filmov posamezne dolžine, zaokrožene na 5 minut.

Risanje grafov#

Običajen graf dobimo z metodo plot. Uporabljamo ga, kadar želimo prikazati spreminjanje vrednosti v odvisnosti od zvezne spremenljivke. Naša hipoteza je, da so zlata leta filma mimo. Graf to zanika.

filmi[filmi.ocena > 9].groupby('desetletje').size().plot()

Razsevni diagram dobimo z metodo plot.scatter. Uporabljamo ga, če želimo ugotoviti povezavo med dvema spremenljivkama.

filmi.plot.scatter('ocena', 'metascore')
filmi[filmi.dolzina < 250].plot.scatter('dolzina', 'ocena')

Stolpčni diagram dobimo z metodo plot.bar. Uporabljamo ga, če želimo primerjati vrednosti pri diskretnih (običajno kategoričnih) spremenljivkah. Pogosto je koristno, da graf uredimo po vrednostih.

filmi.sort_values('zasluzek', ascending=False).head(20).plot.bar(x='naslov', y='zasluzek')

Naloga#

Narišite grafe, ki ustrezno kažejo:

  • Povezavo med IMDB in metascore oceno

  • Spreminjanje povprečne dolžine filmov skozi leta

Stikanje#

osebe = pd.read_csv('podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('podatki/vloge.csv')
zanri = pd.read_csv('podatki/zanri.csv')

Razpredelnice stikamo s funkcijo merge, ki vrne razpredelnico vnosov iz obeh tabel, pri katerih se vsi istoimenski podatki ujemajo.

vloge[vloge.film == 12349]
zanri[zanri.film == 12349]
pd.merge(vloge, zanri).head(20)

V osnovi vsebuje staknjena razpredelnica le tiste vnose, ki se pojavijo v obeh tabelah. Temu principu pravimo notranji stik (inner join). Lahko pa se odločimo, da izberemo tudi tiste vnose, ki imajo podatke le v levi tabeli (left join), le v desni tabeli (right join) ali v vsaj eni tabeli (outer join). Če v eni tabeli ni vnosov, bodo v staknjeni tabeli označene manjkajoče vrednosti. Ker smo v našem primeru podatke jemali iz IMDBja, kjer so za vsak film določeni tako žanri kot vloge, do razlik ne pride.

Včasih želimo stikati tudi po stolpcih z različnimi imeni. V tem primeru funkciji merge podamo argumenta left_on in right_on.

pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')

Poglejmo, katera osebe so nastopale v največ komedijah.

zanri_oseb = pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
zanri_oseb[
    (zanri_oseb.zanr == 'Comedy') &
    (zanri_oseb.vloga == 'I')
].groupby(
    'ime'
).size(
).sort_values(
    ascending=False
).head(20)

Naloga#

  • Izračunajte povprečno oceno vsakega žanra.

  • Kateri režiserji snemajo najdonosnejše filme?