Knjižnica Pandas#
Spodaj je pregled najosnovnejših metod, ki jih ponuja knjižnica Pandas. Vsaka od naštetih metod ponuja še cel kup dodatnih možnosti, ki so natančno opisane v uradni dokumentaciji. Z branjem dokumentacije se vam seveda najbolj splača začeti pri uvodih.
Predpriprava#
# naložimo paket
import pandas as pd
# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('podatki/filmi.csv', index_col='id')
# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 20 vrstic
pd.options.display.max_rows = 20
Osnovni izbori elementov razpredelnic#
Z metodo .head(n=5)
pogledamo prvih n
, z metodo .tail(n=5)
pa zadnjih n
vrstic razpredelnice.
filmi.head(10)
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
4972 | The Birth of a Nation | 195 | 1915 | 6.2 | NaN | 24890 | 10000000.0 | TV-PG | The Stoneman family finds its friendship with ... |
6864 | Intolerance | 197 | 1916 | 7.7 | 99.0 | 15670 | 2180000.0 | Passed | The story of a poor young woman separated by p... |
9968 | Broken Blossoms | 90 | 1919 | 7.3 | NaN | 10423 | NaN | Not Rated | A frail waif, abused by her brutal boxer fathe... |
10323 | The Cabinet of Dr. Caligari | 67 | 1920 | 8.0 | NaN | 64133 | NaN | Not Rated | Hypnotist Dr. Caligari uses a somnambulist, Ce... |
12349 | The Kid | 68 | 1921 | 8.3 | NaN | 126513 | 5450000.0 | Passed | The Tramp cares for an abandoned child, but ev... |
12364 | The Phantom Carriage | 107 | 1921 | 8.0 | NaN | 12624 | NaN | Not Rated | On New Year's Eve, the driver of a ghostly car... |
13442 | Nosferatu | 94 | 1922 | 7.9 | NaN | 97589 | NaN | Not Rated | Vampire Count Orlok expresses interest in a ne... |
14341 | Our Hospitality | 65 | 1923 | 7.8 | NaN | 11428 | 1172499.0 | Passed | A man returns to his Appalachian homestead. On... |
14429 | Safety Last! | 74 | 1923 | 8.1 | NaN | 20887 | 1359903.0 | Not Rated | A boy leaves his small country town and heads ... |
15064 | The Last Laugh | 90 | 1924 | 8.0 | NaN | 14150 | 94812.0 | Not Rated | An aging doorman is forced to face the scorn o... |
filmi.tail()
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
18568902 | Kaun Pravin Tambe? | 134 | 2022 | 8.4 | NaN | 10163 | NaN | NaN | An indian cricketer who shows persistence and ... |
18689424 | Batman v Superman: Dawn of Justice - Ultimate ... | 182 | 2016 | 7.1 | NaN | 57662 | NaN | R | Batman is manipulated by Lex Luthor to fear Su... |
18968540 | Incantation | 110 | 2022 | 6.2 | NaN | 12366 | NaN | TV-MA | Six years ago, Li Ronan was cursed after break... |
20850406 | Sita Ramam | 163 | 2022 | 8.5 | NaN | 38490 | NaN | NaN | An orphan soldier, Lieutenant Ram's life chang... |
21279138 | Maid in Malacañang | 114 | 2022 | 3.9 | NaN | 15273 | NaN | NaN | The Last Days of Ferdinand and Imelda Marcos t... |
Z rezinami pa dostopamo do izbranih vrstic.
filmi[3:10:2]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
10323 | The Cabinet of Dr. Caligari | 67 | 1920 | 8.0 | NaN | 64133 | NaN | Not Rated | Hypnotist Dr. Caligari uses a somnambulist, Ce... |
12364 | The Phantom Carriage | 107 | 1921 | 8.0 | NaN | 12624 | NaN | Not Rated | On New Year's Eve, the driver of a ghostly car... |
14341 | Our Hospitality | 65 | 1923 | 7.8 | NaN | 11428 | 1172499.0 | Passed | A man returns to his Appalachian homestead. On... |
15064 | The Last Laugh | 90 | 1924 | 8.0 | NaN | 14150 | 94812.0 | Not Rated | An aging doorman is forced to face the scorn o... |
Z indeksiranjem razpredelnice dostopamo do posameznih stolpcev.
filmi['ocena']
id
4972 6.2
6864 7.7
9968 7.3
10323 8.0
12349 8.3
...
18568902 8.4
18689424 7.1
18968540 6.2
20850406 8.5
21279138 3.9
Name: ocena, Length: 9999, dtype: float64
Do stolpcev pogosto dostopamo, zato lahko uporabimo tudi krajši zapis.
filmi.ocena
id
4972 6.2
6864 7.7
9968 7.3
10323 8.0
12349 8.3
...
18568902 8.4
18689424 7.1
18968540 6.2
20850406 8.5
21279138 3.9
Name: ocena, Length: 9999, dtype: float64
Če želimo več stolpcev, moramo za indeks podati seznam vseh oznak.
filmi[['naslov', 'ocena']]
naslov | ocena | |
---|---|---|
id | ||
4972 | The Birth of a Nation | 6.2 |
6864 | Intolerance | 7.7 |
9968 | Broken Blossoms | 7.3 |
10323 | The Cabinet of Dr. Caligari | 8.0 |
12349 | The Kid | 8.3 |
... | ... | ... |
18568902 | Kaun Pravin Tambe? | 8.4 |
18689424 | Batman v Superman: Dawn of Justice - Ultimate ... | 7.1 |
18968540 | Incantation | 6.2 |
20850406 | Sita Ramam | 8.5 |
21279138 | Maid in Malacañang | 3.9 |
9999 rows × 2 columns
Do vrednosti z indeksom i
dostopamo z .iloc[i]
, do tiste s ključem k
pa z .loc[k]
.
filmi.iloc[120]
naslov Rebecca
dolzina 130
leto 1940
ocena 8.1
metascore 86.0
glasovi 137358
zasluzek 4360000.0
oznaka Approved
opis A self-conscious woman juggles adjusting to he...
Name: 32976, dtype: object
filmi.loc[97576]
naslov Indiana Jones and the Last Crusade
dolzina 127
leto 1989
ocena 8.2
metascore 65.0
glasovi 750654
zasluzek 197171806.0
oznaka PG-13
opis In 1938, after his father Professor Henry Jone...
Name: 97576, dtype: object
Filtriranje#
Izbor določenih vrstic razpredelnice naredimo tako, da za indeks podamo stolpec logičnih vrednosti, ki ga dobimo z običajnimi operacijami. V vrnjeni razpredelnici bodo ostale vrstice, pri katerih je v stolpcu vrednost True
.
filmi.ocena >= 8
id
4972 False
6864 False
9968 False
10323 True
12349 True
...
18568902 True
18689424 False
18968540 False
20850406 True
21279138 False
Name: ocena, Length: 9999, dtype: bool
filmi[filmi.ocena >= 9.3]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
111161 | The Shawshank Redemption | 142 | 1994 | 9.3 | 81.0 | 2651625 | 28341469.0 | R | Two imprisoned men bond over a number of years... |
15327088 | Kantara | 148 | 2022 | 9.4 | NaN | 33294 | NaN | NaN | It involves culture of Kambla and Bhootha Kola... |
filmi[(filmi.leto > 2010) & (filmi.ocena > 8) | (filmi.ocena < 5)]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
52077 | Plan 9 from Outer Space | 79 | 1957 | 3.9 | 56.0 | 38744 | NaN | Not Rated | Evil aliens attack Earth and set their terribl... |
54673 | The Beast of Yucca Flats | 54 | 1961 | 1.8 | NaN | 11242 | NaN | Unrated | A defecting Soviet scientist is hit by a nucle... |
58548 | Santa Claus Conquers the Martians | 81 | 1964 | 2.6 | NaN | 11838 | NaN | Not Rated | The Martians kidnap Santa Claus because there ... |
59464 | Monster a Go-Go | 68 | 1965 | 1.7 | NaN | 11138 | NaN | TV-PG | A space capsule crash-lands on Earth, and the ... |
60666 | Manos: The Hands of Fate | 70 | 1966 | 1.6 | NaN | 36445 | NaN | Not Rated | A family gets lost on the road and stumbles up... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
15654262 | Chup | 135 | 2022 | 8.4 | NaN | 13098 | NaN | NaN | A psychopath killer, targeting film critics. T... |
16492678 | Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansi... | 87 | 2021 | 9.0 | NaN | 12634 | NaN | NaN | Tanjiro ventures to the south-southeast where ... |
18568902 | Kaun Pravin Tambe? | 134 | 2022 | 8.4 | NaN | 10163 | NaN | NaN | An indian cricketer who shows persistence and ... |
20850406 | Sita Ramam | 163 | 2022 | 8.5 | NaN | 38490 | NaN | NaN | An orphan soldier, Lieutenant Ram's life chang... |
21279138 | Maid in Malacañang | 114 | 2022 | 3.9 | NaN | 15273 | NaN | NaN | The Last Days of Ferdinand and Imelda Marcos t... |
695 rows × 9 columns
Naloga#
Poiščite filme, ki si jih želimo izogniti za vsako ceno, torej tiste, ki so daljši od dveh ur in imajo oceno pod 4.
filmi[(filmi.dolzina > 120) & (filmi.ocena < 4) & (filmi.glasovi > 50000)]
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
118688 | Batman & Robin | 125 | 1997 | 3.7 | 28.0 | 253972 | 107325195.0 | PG-13 | Batman and Robin try to keep their relationshi... |
120179 | Speed 2: Cruise Control | 121 | 1997 | 3.9 | 23.0 | 81714 | 48608066.0 | PG-13 | A computer hacker breaks into the computer sys... |
2574698 | Gunday | 152 | 2014 | 2.6 | NaN | 59270 | NaN | Not Rated | The lives of Calcutta's most powerful Gunday, ... |
7886848 | Sadak 2 | 133 | 2020 | 1.1 | NaN | 95865 | NaN | TV-MA | The film picks up where Sadak left off, revolv... |
10350922 | Laxmii | 141 | 2020 | 2.6 | NaN | 57411 | NaN | TV-MA | Aasif visits his wife's parents' house and hap... |
10888594 | Radhe | 135 | 2021 | 1.9 | NaN | 177814 | NaN | TV-MA | After taking the dreaded gangster Gani Bhai, A... |
Urejanje#
Razpredelnico urejamo z metodo .sort_values
, ki ji podamo ime ali seznam imen stolpcev, po katerih želimo urejati. Po želji lahko tudi povemo, kateri stolpci naj bodo urejeni naraščajoče in kateri padajoče.
filmi.sort_values('dolzina')
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2061702 | To the Forest of Firefly Lights | 45 | 2011 | 7.8 | NaN | 18535 | NaN | NaN | Hotaru is rescued by a spirit when she gets lo... |
15324 | Sherlock Jr. | 45 | 1924 | 8.2 | NaN | 50180 | 977375.0 | Passed | A film projectionist longs to be a detective, ... |
2591814 | The Garden of Words | 46 | 2013 | 7.4 | NaN | 44624 | NaN | TV-14 | A 15-year-old boy and 27-year-old woman find a... |
275230 | Blood: The Last Vampire | 48 | 2000 | 6.6 | 44.0 | 12761 | NaN | Not Rated | Saya is a Japanese vampire slayer whose next m... |
142236 | Dragon Ball Z: Revival Fusion | 51 | 1995 | 7.6 | NaN | 11050 | NaN | PG | The universe is thrown into dimensional chaos ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
107007 | Gettysburg | 271 | 1993 | 7.6 | NaN | 29479 | 10769960.0 | PG | In 1863, the Northern and Southern forces figh... |
74084 | 1900 | 317 | 1976 | 7.7 | 70.0 | 25679 | NaN | Unrated | The epic tale of a class struggle in twentieth... |
1954470 | Gangs of Wasseypur | 321 | 2012 | 8.2 | 89.0 | 96141 | NaN | Not Rated | A clash between Sultan and Shahid Khan leads t... |
346336 | The Best of Youth | 366 | 2003 | 8.5 | 89.0 | 22119 | 274024.0 | R | An Italian epic that follows the lives of two ... |
111341 | Satantango | 439 | 1994 | 8.3 | NaN | 11214 | NaN | Not Rated | On the eve of a large payment, residents of a ... |
9999 rows × 9 columns
# najprej uredi padajoče po oceni, pri vsaki oceni pa še naraščajoče po letu
filmi.sort_values(['ocena', 'leto'], ascending=[False, True])
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
15327088 | Kantara | 148 | 2022 | 9.4 | NaN | 33294 | NaN | NaN | It involves culture of Kambla and Bhootha Kola... |
111161 | The Shawshank Redemption | 142 | 1994 | 9.3 | 81.0 | 2651625 | 28341469.0 | R | Two imprisoned men bond over a number of years... |
68646 | The Godfather | 175 | 1972 | 9.2 | 100.0 | 1838099 | 134966411.0 | R | The aging patriarch of an organized crime dyna... |
252487 | The Chaos Class | 87 | 1975 | 9.2 | NaN | 40747 | NaN | NaN | Lazy, uneducated students share a very close b... |
50083 | 12 Angry Men | 96 | 1957 | 9.0 | 96.0 | 782923 | 4360000.0 | Approved | The jury in a New York City murder trial is fr... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
421051 | Daniel the Wizard | 81 | 2004 | 1.2 | NaN | 14413 | NaN | Not Rated | Evil assassins want to kill Daniel Kublbock, t... |
6038600 | Smolensk | 120 | 2016 | 1.2 | NaN | 39704 | NaN | NaN | An inspired story of people affected by the tr... |
7886848 | Sadak 2 | 133 | 2020 | 1.1 | NaN | 95865 | NaN | TV-MA | The film picks up where Sadak left off, revolv... |
5988370 | Reis | 108 | 2017 | 1.0 | NaN | 73382 | NaN | NaN | A drama about the early life of Recep Tayyip E... |
7221896 | Cumali Ceber | 100 | 2017 | 1.0 | NaN | 38958 | NaN | NaN | Cumali Ceber goes to a vacation with his child... |
9999 rows × 9 columns
Združevanje#
Z metodo .groupby
ustvarimo razpredelnico posebne vrste, v katerem so vrstice združene glede na skupno lastnost.
filmi_po_letih = filmi.groupby('leto')
filmi_po_letih
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f5ab85bc5b0>
# povprečna ocena vsakega leta
filmi_po_letih.ocena.mean()
leto
1915 6.200000
1916 7.700000
1919 7.300000
1920 8.000000
1921 8.150000
...
2018 6.430748
2019 6.493051
2020 6.144304
2021 6.369742
2022 6.361628
Name: ocena, Length: 106, dtype: float64
Če želimo, lahko združujemo tudi po izračunanih lastnostih. Izračunajmo stolpec in ga shranimo v razpredelnico.
filmi['desetletje'] = 10 * (filmi.leto // 10)
filmi
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | desetletje | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
4972 | The Birth of a Nation | 195 | 1915 | 6.2 | NaN | 24890 | 10000000.0 | TV-PG | The Stoneman family finds its friendship with ... | 1910 |
6864 | Intolerance | 197 | 1916 | 7.7 | 99.0 | 15670 | 2180000.0 | Passed | The story of a poor young woman separated by p... | 1910 |
9968 | Broken Blossoms | 90 | 1919 | 7.3 | NaN | 10423 | NaN | Not Rated | A frail waif, abused by her brutal boxer fathe... | 1910 |
10323 | The Cabinet of Dr. Caligari | 67 | 1920 | 8.0 | NaN | 64133 | NaN | Not Rated | Hypnotist Dr. Caligari uses a somnambulist, Ce... | 1920 |
12349 | The Kid | 68 | 1921 | 8.3 | NaN | 126513 | 5450000.0 | Passed | The Tramp cares for an abandoned child, but ev... | 1920 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18568902 | Kaun Pravin Tambe? | 134 | 2022 | 8.4 | NaN | 10163 | NaN | NaN | An indian cricketer who shows persistence and ... | 2020 |
18689424 | Batman v Superman: Dawn of Justice - Ultimate ... | 182 | 2016 | 7.1 | NaN | 57662 | NaN | R | Batman is manipulated by Lex Luthor to fear Su... | 2010 |
18968540 | Incantation | 110 | 2022 | 6.2 | NaN | 12366 | NaN | TV-MA | Six years ago, Li Ronan was cursed after break... | 2020 |
20850406 | Sita Ramam | 163 | 2022 | 8.5 | NaN | 38490 | NaN | NaN | An orphan soldier, Lieutenant Ram's life chang... | 2020 |
21279138 | Maid in Malacañang | 114 | 2022 | 3.9 | NaN | 15273 | NaN | NaN | The Last Days of Ferdinand and Imelda Marcos t... | 2020 |
9999 rows × 10 columns
filmi_po_desetletjih = filmi.groupby('desetletje')
Preštejemo, koliko filmov je bilo v vsakem desetletju. Pri večini stolpcev dobimo iste številke, ker imamo v vsakem stolpcu enako vnosov. Če kje kakšen podatek manjkal, je številka manjša.
filmi_po_desetletjih.count()
naslov | dolzina | leto | ocena | metascore | glasovi | zasluzek | oznaka | opis | |
---|---|---|---|---|---|---|---|---|---|
desetletje | |||||||||
1910 | 3 | 3 | 3 | 3 | 1 | 3 | 2 | 3 | 3 |
1920 | 27 | 27 | 27 | 27 | 4 | 27 | 18 | 27 | 27 |
1930 | 80 | 80 | 80 | 80 | 39 | 80 | 36 | 80 | 80 |
1940 | 134 | 134 | 134 | 134 | 63 | 134 | 46 | 133 | 134 |
1950 | 205 | 205 | 205 | 205 | 113 | 205 | 92 | 205 | 205 |
1960 | 284 | 284 | 284 | 284 | 172 | 284 | 150 | 281 | 284 |
1970 | 410 | 410 | 410 | 410 | 323 | 410 | 276 | 394 | 410 |
1980 | 823 | 823 | 823 | 823 | 721 | 823 | 711 | 809 | 823 |
1990 | 1420 | 1420 | 1420 | 1420 | 1128 | 1420 | 1324 | 1399 | 1420 |
2000 | 2575 | 2575 | 2575 | 2575 | 2183 | 2575 | 2203 | 2507 | 2575 |
2010 | 3358 | 3358 | 3358 | 3358 | 2728 | 3358 | 2353 | 3228 | 3358 |
2020 | 680 | 680 | 680 | 680 | 500 | 680 | 47 | 586 | 680 |
Če želimo dobiti le število članov posamezne skupine, uporabimo metodo .size()
. V tem primeru dobimo le stolpec, ne razpredelnice.
filmi_po_desetletjih.size()
desetletje
1910 3
1920 27
1930 80
1940 134
1950 205
1960 284
1970 410
1980 823
1990 1420
2000 2575
2010 3358
2020 680
dtype: int64
Pogledamo povprečja vsakega desetletja. Dobimo povprečno leto, dolžino, ocene in zaslužek. Povprečnega naslova ne dobimo, ker se ga ne da izračunati, zato ustreznega stolpca ni.
filmi_po_desetletjih.mean()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1870, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
1869 try:
-> 1870 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
1871 except Exception as err:
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:850, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
848 preserve_dtype = True
--> 850 result = self._aggregate_series_pure_python(obj, func)
852 npvalues = lib.maybe_convert_objects(result, try_float=False)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/ops.py:871, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
870 for i, group in enumerate(splitter):
--> 871 res = func(group)
872 res = extract_result(res)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2376, in GroupBy.mean.<locals>.<lambda>(x)
2373 else:
2374 result = self._cython_agg_general(
2375 "mean",
-> 2376 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
2377 numeric_only=numeric_only,
2378 )
2379 return result.__finalize__(self.obj, method="groupby")
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/series.py:6226, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
6218 @doc(make_doc("mean", ndim=1))
6219 def mean(
6220 self,
(...)
6224 **kwargs,
6225 ):
-> 6226 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/generic.py:11969, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
11962 def mean(
11963 self,
11964 axis: Axis | None = 0,
(...)
11967 **kwargs,
11968 ) -> Series | float:
> 11969 return self._stat_function(
11970 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
11971 )
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/generic.py:11926, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
11924 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11926 return self._reduce(
11927 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11928 )
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/series.py:6134, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
6130 raise TypeError(
6131 f"Series.{name} does not allow {kwd_name}={numeric_only} "
6132 "with non-numeric dtypes."
6133 )
-> 6134 return op(delegate, skipna=skipna, **kwds)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/nanops.py:1693, in _ensure_numeric(x)
1691 if isinstance(x, str):
1692 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1693 raise TypeError(f"Could not convert string '{x}' to numeric")
1694 try:
TypeError: Could not convert string 'The Birth of a NationIntoleranceBroken Blossoms' to numeric
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[24], line 1
----> 1 filmi_po_desetletjih.mean()
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2374, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
2367 return self._numba_agg_general(
2368 grouped_mean,
2369 executor.float_dtype_mapping,
2370 engine_kwargs,
2371 min_periods=0,
2372 )
2373 else:
-> 2374 result = self._cython_agg_general(
2375 "mean",
2376 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
2377 numeric_only=numeric_only,
2378 )
2379 return result.__finalize__(self.obj, method="groupby")
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1925, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
1922 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
1923 return result
-> 1925 new_mgr = data.grouped_reduce(array_func)
1926 res = self._wrap_agged_manager(new_mgr)
1927 out = self._wrap_aggregated_output(res)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/internals/managers.py:1428, in BlockManager.grouped_reduce(self, func)
1424 if blk.is_object:
1425 # split on object-dtype blocks bc some columns may raise
1426 # while others do not.
1427 for sb in blk._split():
-> 1428 applied = sb.apply(func)
1429 result_blocks = extend_blocks(applied, result_blocks)
1430 else:
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/internals/blocks.py:366, in Block.apply(self, func, **kwargs)
360 @final
361 def apply(self, func, **kwargs) -> list[Block]:
362 """
363 apply the function to my values; return a block if we are not
364 one
365 """
--> 366 result = func(self.values, **kwargs)
368 result = maybe_coerce_values(result)
369 return self._split_op_result(result)
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1922, in GroupBy._cython_agg_general.<locals>.array_func(values)
1919 else:
1920 return result
-> 1922 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
1923 return result
File /opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1874, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
1872 msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
1873 # preserve the kind of exception that raised
-> 1874 raise type(err)(msg) from err
1876 if ser.dtype == object:
1877 res_values = res_values.astype(object, copy=False)
TypeError: agg function failed [how->mean,dtype->object]
Naloga#
Izračunajte število filmov posamezne dolžine, zaokrožene na 5 minut.
Risanje grafov#
Običajen graf dobimo z metodo plot
. Uporabljamo ga, kadar želimo prikazati spreminjanje vrednosti v odvisnosti od zvezne spremenljivke. Naša hipoteza je, da so zlata leta filma mimo. Graf to zanika.
filmi[filmi.ocena > 9].groupby('desetletje').size().plot()
Razsevni diagram dobimo z metodo plot.scatter
. Uporabljamo ga, če želimo ugotoviti povezavo med dvema spremenljivkama.
filmi.plot.scatter('ocena', 'metascore')
filmi[filmi.dolzina < 250].plot.scatter('dolzina', 'ocena')
Stolpčni diagram dobimo z metodo plot.bar
. Uporabljamo ga, če želimo primerjati vrednosti pri diskretnih (običajno kategoričnih) spremenljivkah. Pogosto je koristno, da graf uredimo po vrednostih.
filmi.sort_values('zasluzek', ascending=False).head(20).plot.bar(x='naslov', y='zasluzek')
Naloga#
Narišite grafe, ki ustrezno kažejo:
Povezavo med IMDB in metascore oceno
Spreminjanje povprečne dolžine filmov skozi leta
Stikanje#
osebe = pd.read_csv('podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('podatki/vloge.csv')
zanri = pd.read_csv('podatki/zanri.csv')
Razpredelnice stikamo s funkcijo merge
, ki vrne razpredelnico vnosov iz obeh tabel, pri katerih se vsi istoimenski podatki ujemajo.
vloge[vloge.film == 12349]
zanri[zanri.film == 12349]
pd.merge(vloge, zanri).head(20)
V osnovi vsebuje staknjena razpredelnica le tiste vnose, ki se pojavijo v obeh tabelah. Temu principu pravimo notranji stik (inner join). Lahko pa se odločimo, da izberemo tudi tiste vnose, ki imajo podatke le v levi tabeli (left join), le v desni tabeli (right join) ali v vsaj eni tabeli (outer join). Če v eni tabeli ni vnosov, bodo v staknjeni tabeli označene manjkajoče vrednosti. Ker smo v našem primeru podatke jemali iz IMDBja, kjer so za vsak film določeni tako žanri kot vloge, do razlik ne pride.
Včasih želimo stikati tudi po stolpcih z različnimi imeni. V tem primeru funkciji merge
podamo argumenta left_on
in right_on
.
pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
Poglejmo, katera osebe so nastopale v največ komedijah.
zanri_oseb = pd.merge(pd.merge(vloge, zanri), osebe, left_on='oseba', right_on='id')
zanri_oseb[
(zanri_oseb.zanr == 'Comedy') &
(zanri_oseb.vloga == 'I')
].groupby(
'ime'
).size(
).sort_values(
ascending=False
).head(20)
Naloga#
Izračunajte povprečno oceno vsakega žanra.
Kateri režiserji snemajo najdonosnejše filme?