Naivni Bayesov klasifikator#

Zanima nas, ali lahko iz opisa filma napovemo njegove žanre. Gre za klasifikacijski problem, saj želimo filme klasificirati v žanre, naša naloga pa je napisati ustrezen program, ki mu pravimo klasifikator.

Predpriprava#

# naložimo paket
import pandas as pd

# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('podatki/filmi.csv', index_col='id')
osebe = pd.read_csv('podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('podatki/vloge.csv')
zanri = pd.read_csv('podatki/zanri.csv')

Korenjenje besed#

Da zadevo naredimo bolj obvladljivo, bomo opis predstavili le z množico korenov besed, ki se v opisu pojavljajo.

def koren_besede(beseda):
    beseda = ''.join(znak for znak in beseda if znak.isalpha())
    if not beseda:
        return '$'
    konec = len(beseda) - 1
    if beseda[konec] in 'ds':
        konec -= 1
    while konec >= 0 and beseda[konec] in 'aeiou':
        konec -= 1
    return beseda[:konec + 1]

def koreni_besed(niz):
    return pd.Series(sorted({
        koren_besede(beseda) for beseda in niz.replace('-', ' ').lower().split() if beseda
    }))
koreni_besed("In 1938, after his father Professor Henry Jones, Sr. goes missing while pursuing the Holy Grail, Indiana Jones finds himself up against Adolf Hitler's Nazis again to stop them obtaining its powers.")
0             $
1         adolf
2         after
3         again
4       against
5        father
6          find
7             g
8         grail
9             h
10        henry
11      himself
12       hitler
13         holy
14           in
15       indian
16           it
17          jon
18      missing
19          naz
20    obtaining
21        power
22    professor
23     pursuing
24           sr
25         stop
26            t
27           th
28         them
29           up
30         whil
dtype: object

Bayesov izrek#

Zanimala nas bo torej verjetnost, da ima film žanr \(Ž_i\) ob pogoju, da njegov opis vsebuje korene \(K_1, \ldots, K_m\), torej

\[P(Ž_i | K_1 \cap \cdots \cap K_n)\]

Pri tem se bomo poslužili Bayesovega izreka

\[P(A | B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B | A) \cdot P(A)}{P(B)}\]

zaradi česar našemu klasifikatorju pravimo Bayesov klasifikator. Velja

\[P(Ž_i | K_1 \cap \cdots \cap K_n) = \frac{P(K_1 \cap \cdots \cap K_n | Ž_i) \cdot P(Ž_i)}{P(K_1 \cap \cdots \cap K_n)}\]

Nadalje si nalogo poenostavimo s predpostavko, da so pojavitve besed med seboj neodvisne. To sicer ni res, na primer ob besedi treasure se bolj pogosto pojavlja beseda hidden kot na primer boring, zato pravimo, da je klasifikator naiven. Ob tej predpostavki velja:

\[P(K_1 \cap \cdots \cap K_n | Ž_i) = P(K_1 | Ž_i) \cdot \cdots \cdot P(K_n | Ž_i)\]

oziroma

\[P(Ž_i | K_1 \cap \cdots \cap K_n) = \frac{P(K_1 | Ž_i) \cdot \cdots \cdot P(K_n | Ž_i) \cdot P(Ž_i)}{P(K_1 \cap \cdots \cap K_n)}\]

Filmu, katerega opis vsebuje korene \(K_1, \dots, K_n\) bomo priredili tiste žanre \(Ž_i\), pri katerih je dana verjetnost največja. Ker imenovalec ni odvisen od žanra, moramo torej za vsak \(Ž_i\) izračunati le števec:

\[P(K_1 | Ž_i) \cdot \cdots \cdot P(K_n | Ž_i) \cdot P(Ž_i)\]

Vse te podatke znamo izračunati, zato se lahko lotimo dela.

Verjetnost posameznega žanra \(P(Ž)\) izračunamo brez večjih težav:

verjetnosti_zanrov = zanri.groupby('zanr').size() / len(filmi)
verjetnosti_zanrov.sort_values()
zanr
Reality-TV    0.000100
Film-Noir     0.005801
Western       0.011801
Musical       0.013501
Sport         0.020102
War           0.022602
Music         0.027603
History       0.034103
Family        0.047005
Animation     0.047005
Biography     0.067107
Sci-Fi        0.072607
Fantasy       0.074507
Mystery       0.103810
Horror        0.126713
Adventure     0.171017
Thriller      0.171217
Romance       0.173217
Crime         0.206121
Action        0.250625
Comedy        0.366237
Drama         0.568057
dtype: float64

Verjetnosti \(P(K|Ž)\) bomo shranili v razpredelnico, v kateri bodo vrstice ustrezale korenom \(K\), stolpci pa žanrom \(Ž\). Najprej moramo poiskati vse filme, ki imajo žanr \(Ž\), njihov opis pa vsebuje koren \(K\). Vzemimo vse opise filmov:

filmi.opis
id
4972        The Stoneman family finds its friendship with ...
6864        The story of a poor young woman separated by p...
9968        A frail waif, abused by her brutal boxer fathe...
10323       Hypnotist Dr. Caligari uses a somnambulist, Ce...
12349       The Tramp cares for an abandoned child, but ev...
                                  ...                        
18568902    An indian cricketer who shows persistence and ...
18689424    Batman is manipulated by Lex Luthor to fear Su...
18968540    Six years ago, Li Ronan was cursed after break...
20850406    An orphan soldier, Lieutenant Ram's life chang...
21279138    The Last Days of Ferdinand and Imelda Marcos t...
Name: opis, Length: 9999, dtype: object

To vrsto nizov pretvorimo v vrsto množic besed. Uporabimo metodo apply, ki dano funkcijo uporabi na vsakem vnosu.

filmi.opis.apply(
    koreni_besed
)
/tmp/ipykernel_2057/601056397.py:1: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  filmi.opis.apply(
0 1 2 3 4 5 6 7 8 9 ... 45 46 47 48 49 50 51 52 53 54
id
4972 affect an arm assassination birth both by cameron civil development ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6864 an baby by from her history husban interwoven intoleranc ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9968 abus befriend boxer brutal by chines consequenc district father ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10323 caligar cesar commit dr hypnotist murder somnambulist t us ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12349 abandon an but car chil event for in jeopardy put ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18568902 achiev ag an at believ career chapter cricketer end ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18689424 an batman by clash cris dividing during existenc fear ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18968540 action after ag breaking consequenc curs daughter from her ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20850406 after an back between blossom camp caught chang com ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21279138 $ an day dynasty ferdinan hour imeld last marc of ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

9999 rows × 55 columns

Po nekaj iskanja po internetu in masiranja pridemo do iskane razpredelnice:

koreni_filmov = filmi.opis.apply(
    koreni_besed
).stack(
).reset_index(
    level='id'
).rename(columns={
    'id': 'film',
    0: 'koren',
})
koreni_filmov
/tmp/ipykernel_2057/488542747.py:1: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  koreni_filmov = filmi.opis.apply(
film koren
0 4972 affect
1 4972 an
2 4972 arm
3 4972 assassination
4 4972 birth
... ... ...
9 21279138 of
10 21279138 story
11 21279138 tell
12 21279138 th
13 21279138 untol

227443 rows × 2 columns

Razpredelnico združimo z razpredelnico žanrov, da dobimo razpredelnico korenov žanrov.

koreni_zanrov = pd.merge(
    koreni_filmov,
    zanri
)[['koren', 'zanr']]
koreni_zanrov
koren zanr
0 affect Drama
1 affect History
2 affect War
3 an Drama
4 an History
... ... ...
588190 of Drama
588191 story Drama
588192 tell Drama
588193 th Drama
588194 untol Drama

588195 rows × 2 columns

S pomočjo funkcije crosstab preštejemo, kolikokrat se vsaka kombinacija pojavi.

pojavitve_korenov_po_zanrih = pd.crosstab(koreni_zanrov.koren, koreni_zanrov.zanr)
pojavitve_korenov_po_zanrih
zanr Action Adventure Animation Biography Comedy Crime Drama Family Fantasy Film-Noir ... Music Musical Mystery Reality-TV Romance Sci-Fi Sport Thriller War Western
koren
2270 1510 412 530 3239 1888 5086 400 669 55 ... 242 117 982 1 1536 669 170 1612 204 102
$ 233 144 27 159 326 184 676 38 56 2 ... 26 10 89 1 161 85 29 155 46 14
aang 1 1 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
aaron 1 1 0 0 2 1 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
aart 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ángel 0 0 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
æon 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
çanakkal 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
édith 0 0 0 1 0 0 1 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
émilien 1 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

15528 rows × 22 columns

Iskane verjetnosti sedaj dobimo tako, da vsak stolpec delimo s številom filmov danega žanra. Da ne bomo dobili ničelne verjetnosti pri korenih, ki se v našem vzorcu ne pojavijo, verjetnost malenkost povečamo.

verjetnosti_korenov_po_zanrih = pojavitve_korenov_po_zanrih / zanri.groupby('zanr').size() + 0.001

Poglejmo, kaj so najpogostejši koreni pri nekaj žanrih:

verjetnosti_korenov_po_zanrih.Crime.sort_values(ascending=False).head(20)
koren
          0.917060
th        0.664270
an        0.639525
t         0.599253
of        0.534721
h         0.463882
in        0.419244
on        0.245056
with      0.230015
for       0.184891
wh        0.177128
by        0.152868
when      0.146560
from      0.132490
after     0.129093
their     0.122786
murder    0.116963
her       0.115508
that      0.103863
him       0.094644
Name: Crime, dtype: float64
verjetnosti_korenov_po_zanrih.Romance.sort_values(ascending=False).head(20)
koren
         0.887836
an       0.645919
th       0.633217
t        0.574903
of       0.466935
in       0.452501
h        0.434025
with     0.326058
her      0.279291
on       0.223864
for      0.212894
lov      0.210007
wh       0.164972
their    0.157467
young    0.156889
woman    0.153425
when     0.141300
lif      0.134372
that     0.133794
man      0.126866
Name: Romance, dtype: float64
verjetnosti_korenov_po_zanrih['Sci-Fi'].sort_values(ascending=False).head(20)
koren
         0.922488
th       0.707612
an       0.662157
t        0.642873
of       0.550587
in       0.434884
h        0.356372
on       0.264085
with     0.215876
for      0.170421
from     0.169044
that     0.164912
by       0.160780
after    0.134609
their    0.134609
earth    0.124967
when     0.123590
$        0.118080
int      0.116702
wh       0.113948
Name: Sci-Fi, dtype: float64

Žanre sedaj določimo tako, da za vsak žanr pomnožimo verjetnost žanra in pogojne verjetnosti vseh korenov, ki nastopajo v opisu filma.

def doloci_zanre(opis):
    faktorji_zanrov = verjetnosti_zanrov * verjetnosti_korenov_po_zanrih[
        verjetnosti_korenov_po_zanrih.index.isin(
            koreni_besed(opis)
        )
    ].prod()
    faktorji_zanrov /= max(faktorji_zanrov)
    return faktorji_zanrov.sort_values(ascending=False).head(5)
doloci_zanre('Alien space ship appears above Slovenia.')
zanr
Sci-Fi       1.000000
Adventure    0.990078
Action       0.454610
Animation    0.335296
Horror       0.122791
dtype: float64
doloci_zanre('A story about a young mathematician, who discovers her artistic side')
zanr
Drama        1.000000
Biography    0.824923
Romance      0.375775
Musical      0.118636
Comedy       0.099659
dtype: float64