Naivni Bayesov klasifikator#
Zanima nas, ali lahko iz opisa filma napovemo njegove žanre. Gre za klasifikacijski problem, saj želimo filme klasificirati v žanre, naša naloga pa je napisati ustrezen program, ki mu pravimo klasifikator.
Predpriprava#
# naložimo paket
import pandas as pd
# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('podatki/filmi.csv', index_col='id')
osebe = pd.read_csv('podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('podatki/vloge.csv')
zanri = pd.read_csv('podatki/zanri.csv')
Korenjenje besed#
Da zadevo naredimo bolj obvladljivo, bomo opis predstavili le z množico korenov besed, ki se v opisu pojavljajo.
def koren_besede(beseda):
beseda = ''.join(znak for znak in beseda if znak.isalpha())
if not beseda:
return '$'
konec = len(beseda) - 1
if beseda[konec] in 'ds':
konec -= 1
while konec >= 0 and beseda[konec] in 'aeiou':
konec -= 1
return beseda[:konec + 1]
def koreni_besed(niz):
return pd.Series(sorted({
koren_besede(beseda) for beseda in niz.replace('-', ' ').lower().split() if beseda
}))
koreni_besed("In 1938, after his father Professor Henry Jones, Sr. goes missing while pursuing the Holy Grail, Indiana Jones finds himself up against Adolf Hitler's Nazis again to stop them obtaining its powers.")
0 $
1 adolf
2 after
3 again
4 against
5 father
6 find
7 g
8 grail
9 h
10 henry
11 himself
12 hitler
13 holy
14 in
15 indian
16 it
17 jon
18 missing
19 naz
20 obtaining
21 power
22 professor
23 pursuing
24 sr
25 stop
26 t
27 th
28 them
29 up
30 whil
dtype: object
Bayesov izrek#
Zanimala nas bo torej verjetnost, da ima film žanr \(Ž_i\) ob pogoju, da njegov opis vsebuje korene \(K_1, \ldots, K_m\), torej
Pri tem se bomo poslužili Bayesovega izreka
zaradi česar našemu klasifikatorju pravimo Bayesov klasifikator. Velja
Nadalje si nalogo poenostavimo s predpostavko, da so pojavitve besed med seboj neodvisne. To sicer ni res, na primer ob besedi treasure se bolj pogosto pojavlja beseda hidden kot na primer boring, zato pravimo, da je klasifikator naiven. Ob tej predpostavki velja:
oziroma
Filmu, katerega opis vsebuje korene \(K_1, \dots, K_n\) bomo priredili tiste žanre \(Ž_i\), pri katerih je dana verjetnost največja. Ker imenovalec ni odvisen od žanra, moramo torej za vsak \(Ž_i\) izračunati le števec:
Vse te podatke znamo izračunati, zato se lahko lotimo dela.
Verjetnost posameznega žanra \(P(Ž)\) izračunamo brez večjih težav:
verjetnosti_zanrov = zanri.groupby('zanr').size() / len(filmi)
verjetnosti_zanrov.sort_values()
zanr
Reality-TV 0.000100
Film-Noir 0.005801
Western 0.011801
Musical 0.013501
Sport 0.020102
War 0.022602
Music 0.027603
History 0.034103
Animation 0.047005
Family 0.047005
Biography 0.067107
Sci-Fi 0.072607
Fantasy 0.074507
Mystery 0.103810
Horror 0.126713
Adventure 0.171017
Thriller 0.171217
Romance 0.173217
Crime 0.206121
Action 0.250625
Comedy 0.366237
Drama 0.568057
dtype: float64
Verjetnosti \(P(K|Ž)\) bomo shranili v razpredelnico, v kateri bodo vrstice ustrezale korenom \(K\), stolpci pa žanrom \(Ž\). Najprej moramo poiskati vse filme, ki imajo žanr \(Ž\), njihov opis pa vsebuje koren \(K\). Vzemimo vse opise filmov:
filmi.opis
id
4972 The Stoneman family finds its friendship with ...
6864 The story of a poor young woman separated by p...
9968 A frail waif, abused by her brutal boxer fathe...
10323 Hypnotist Dr. Caligari uses a somnambulist, Ce...
12349 The Tramp cares for an abandoned child, but ev...
...
18568902 An indian cricketer who shows persistence and ...
18689424 Batman is manipulated by Lex Luthor to fear Su...
18968540 Six years ago, Li Ronan was cursed after break...
20850406 An orphan soldier, Lieutenant Ram's life chang...
21279138 The Last Days of Ferdinand and Imelda Marcos t...
Name: opis, Length: 9999, dtype: object
To vrsto nizov pretvorimo v vrsto množic besed. Uporabimo metodo apply
, ki dano funkcijo uporabi na vsakem vnosu.
filmi.opis.apply(
koreni_besed
)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
4972 | affect | an | arm | assassination | birth | both | by | cameron | civil | development | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6864 | an | baby | by | from | her | history | husban | interwoven | intoleranc | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
9968 | abus | befriend | boxer | brutal | by | chines | consequenc | district | father | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
10323 | caligar | cesar | commit | dr | hypnotist | murder | somnambulist | t | us | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
12349 | abandon | an | but | car | chil | event | for | in | jeopardy | put | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18568902 | achiev | ag | an | at | believ | career | chapter | cricketer | end | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
18689424 | an | batman | by | clash | cris | dividing | during | existenc | fear | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
18968540 | action | after | ag | breaking | consequenc | curs | daughter | from | her | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
20850406 | after | an | back | between | blossom | camp | caught | chang | com | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
21279138 | $ | an | day | dynasty | ferdinan | hour | imeld | last | marc | of | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9999 rows × 55 columns
Po nekaj iskanja po internetu in masiranja pridemo do iskane razpredelnice:
koreni_filmov = filmi.opis.apply(
koreni_besed
).stack(
).reset_index(
level='id'
).rename(columns={
'id': 'film',
0: 'koren',
})
koreni_filmov
film | koren | |
---|---|---|
0 | 4972 | affect |
1 | 4972 | an |
2 | 4972 | arm |
3 | 4972 | assassination |
4 | 4972 | birth |
... | ... | ... |
9 | 21279138 | of |
10 | 21279138 | story |
11 | 21279138 | tell |
12 | 21279138 | th |
13 | 21279138 | untol |
227443 rows × 2 columns
Razpredelnico združimo z razpredelnico žanrov, da dobimo razpredelnico korenov žanrov.
koreni_zanrov = pd.merge(
koreni_filmov,
zanri
)[['koren', 'zanr']]
koreni_zanrov
koren | zanr | |
---|---|---|
0 | affect | Drama |
1 | affect | History |
2 | affect | War |
3 | an | Drama |
4 | an | History |
... | ... | ... |
588190 | of | Drama |
588191 | story | Drama |
588192 | tell | Drama |
588193 | th | Drama |
588194 | untol | Drama |
588195 rows × 2 columns
S pomočjo funkcije crosstab
preštejemo, kolikokrat se vsaka kombinacija pojavi.
pojavitve_korenov_po_zanrih = pd.crosstab(koreni_zanrov.koren, koreni_zanrov.zanr)
pojavitve_korenov_po_zanrih
zanr | Action | Adventure | Animation | Biography | Comedy | Crime | Drama | Family | Fantasy | Film-Noir | ... | Music | Musical | Mystery | Reality-TV | Romance | Sci-Fi | Sport | Thriller | War | Western |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
koren | |||||||||||||||||||||
2270 | 1510 | 412 | 530 | 3239 | 1888 | 5086 | 400 | 669 | 55 | ... | 242 | 117 | 982 | 1 | 1536 | 669 | 170 | 1612 | 204 | 102 | |
$ | 233 | 144 | 27 | 159 | 326 | 184 | 676 | 38 | 56 | 2 | ... | 26 | 10 | 89 | 1 | 161 | 85 | 29 | 155 | 46 | 14 |
aang | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
aaron | 1 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
aart | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ángel | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
æon | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
çanakkal | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
édith | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
émilien | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15528 rows × 22 columns
Iskane verjetnosti sedaj dobimo tako, da vsak stolpec delimo s številom filmov danega žanra. Da ne bomo dobili ničelne verjetnosti pri korenih, ki se v našem vzorcu ne pojavijo, verjetnost malenkost povečamo.
verjetnosti_korenov_po_zanrih = pojavitve_korenov_po_zanrih / zanri.groupby('zanr').size() + 0.001
Poglejmo, kaj so najpogostejši koreni pri nekaj žanrih:
verjetnosti_korenov_po_zanrih.Crime.sort_values(ascending=False).head(20)
koren
0.917060
th 0.664270
an 0.639525
t 0.599253
of 0.534721
h 0.463882
in 0.419244
on 0.245056
with 0.230015
for 0.184891
wh 0.177128
by 0.152868
when 0.146560
from 0.132490
after 0.129093
their 0.122786
murder 0.116963
her 0.115508
that 0.103863
him 0.094644
Name: Crime, dtype: float64
verjetnosti_korenov_po_zanrih.Romance.sort_values(ascending=False).head(20)
koren
0.887836
an 0.645919
th 0.633217
t 0.574903
of 0.466935
in 0.452501
h 0.434025
with 0.326058
her 0.279291
on 0.223864
for 0.212894
lov 0.210007
wh 0.164972
their 0.157467
young 0.156889
woman 0.153425
when 0.141300
lif 0.134372
that 0.133794
man 0.126866
Name: Romance, dtype: float64
verjetnosti_korenov_po_zanrih['Sci-Fi'].sort_values(ascending=False).head(20)
koren
0.922488
th 0.707612
an 0.662157
t 0.642873
of 0.550587
in 0.434884
h 0.356372
on 0.264085
with 0.215876
for 0.170421
from 0.169044
that 0.164912
by 0.160780
their 0.134609
after 0.134609
earth 0.124967
when 0.123590
$ 0.118080
int 0.116702
wh 0.113948
Name: Sci-Fi, dtype: float64
Žanre sedaj določimo tako, da za vsak žanr pomnožimo verjetnost žanra in pogojne verjetnosti vseh korenov, ki nastopajo v opisu filma.
def doloci_zanre(opis):
faktorji_zanrov = verjetnosti_zanrov * verjetnosti_korenov_po_zanrih[
verjetnosti_korenov_po_zanrih.index.isin(
koreni_besed(opis)
)
].prod()
faktorji_zanrov /= max(faktorji_zanrov)
return faktorji_zanrov.sort_values(ascending=False).head(5)
doloci_zanre('Alien space ship appears above Slovenia.')
zanr
Sci-Fi 1.000000
Adventure 0.990078
Action 0.454610
Animation 0.335296
Horror 0.122791
dtype: float64
doloci_zanre('A story about a young mathematician, who discovers her artistic side')
zanr
Drama 1.000000
Biography 0.824923
Romance 0.375775
Musical 0.118636
Comedy 0.099659
dtype: float64