Naivni Bayesov klasifikator#
Zanima nas, ali lahko iz opisa filma napovemo njegove žanre. Gre za klasifikacijski problem, saj želimo filme klasificirati v žanre, naša naloga pa je napisati ustrezen program, ki mu pravimo klasifikator.
Predpriprava#
# naložimo paket
import pandas as pd
# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('podatki/filmi.csv', index_col='id')
osebe = pd.read_csv('podatki/osebe.csv', index_col='id')
vloge = pd.read_csv('podatki/vloge.csv')
zanri = pd.read_csv('podatki/zanri.csv')
Korenjenje besed#
Da zadevo naredimo bolj obvladljivo, bomo opis predstavili le z množico korenov besed, ki se v opisu pojavljajo.
def koren_besede(beseda):
    beseda = ''.join(znak for znak in beseda if znak.isalpha())
    if not beseda:
        return '$'
    konec = len(beseda) - 1
    if beseda[konec] in 'ds':
        konec -= 1
    while konec >= 0 and beseda[konec] in 'aeiou':
        konec -= 1
    return beseda[:konec + 1]
def koreni_besed(niz):
    return pd.Series(sorted({
        koren_besede(beseda) for beseda in niz.replace('-', ' ').lower().split() if beseda
    }))
koreni_besed("In 1938, after his father Professor Henry Jones, Sr. goes missing while pursuing the Holy Grail, Indiana Jones finds himself up against Adolf Hitler's Nazis again to stop them obtaining its powers.")
0             $
1         adolf
2         after
3         again
4       against
5        father
6          find
7             g
8         grail
9             h
10        henry
11      himself
12       hitler
13         holy
14           in
15       indian
16           it
17          jon
18      missing
19          naz
20    obtaining
21        power
22    professor
23     pursuing
24           sr
25         stop
26            t
27           th
28         them
29           up
30         whil
dtype: object
Bayesov izrek#
Zanimala nas bo torej verjetnost, da ima film žanr \(Ž_i\) ob pogoju, da njegov opis vsebuje korene \(K_1, \ldots, K_m\), torej
Pri tem se bomo poslužili Bayesovega izreka
zaradi česar našemu klasifikatorju pravimo Bayesov klasifikator. Velja
Nadalje si nalogo poenostavimo s predpostavko, da so pojavitve besed med seboj neodvisne. To sicer ni res, na primer ob besedi treasure se bolj pogosto pojavlja beseda hidden kot na primer boring, zato pravimo, da je klasifikator naiven. Ob tej predpostavki velja:
oziroma
Filmu, katerega opis vsebuje korene \(K_1, \dots, K_n\) bomo priredili tiste žanre \(Ž_i\), pri katerih je dana verjetnost največja. Ker imenovalec ni odvisen od žanra, moramo torej za vsak \(Ž_i\) izračunati le števec:
Vse te podatke znamo izračunati, zato se lahko lotimo dela.
Verjetnost posameznega žanra \(P(Ž)\) izračunamo brez večjih težav:
verjetnosti_zanrov = zanri.groupby('zanr').size() / len(filmi)
verjetnosti_zanrov.sort_values()
zanr
Reality-TV    0.000100
Film-Noir     0.005801
Western       0.011801
Musical       0.013501
Sport         0.020102
War           0.022602
Music         0.027603
History       0.034103
Family        0.047005
Animation     0.047005
Biography     0.067107
Sci-Fi        0.072607
Fantasy       0.074507
Mystery       0.103810
Horror        0.126713
Adventure     0.171017
Thriller      0.171217
Romance       0.173217
Crime         0.206121
Action        0.250625
Comedy        0.366237
Drama         0.568057
dtype: float64
Verjetnosti \(P(K|Ž)\) bomo shranili v razpredelnico, v kateri bodo vrstice ustrezale korenom \(K\), stolpci pa žanrom \(Ž\). Najprej moramo poiskati vse filme, ki imajo žanr \(Ž\), njihov opis pa vsebuje koren \(K\). Vzemimo vse opise filmov:
filmi.opis
id
4972        The Stoneman family finds its friendship with ...
6864        The story of a poor young woman separated by p...
9968        A frail waif, abused by her brutal boxer fathe...
10323       Hypnotist Dr. Caligari uses a somnambulist, Ce...
12349       The Tramp cares for an abandoned child, but ev...
                                  ...                        
18568902    An indian cricketer who shows persistence and ...
18689424    Batman is manipulated by Lex Luthor to fear Su...
18968540    Six years ago, Li Ronan was cursed after break...
20850406    An orphan soldier, Lieutenant Ram's life chang...
21279138    The Last Days of Ferdinand and Imelda Marcos t...
Name: opis, Length: 9999, dtype: object
To vrsto nizov pretvorimo v vrsto množic besed. Uporabimo metodo apply, ki dano funkcijo uporabi na vsakem vnosu.
filmi.opis.apply(
    koreni_besed
)
/tmp/ipykernel_2057/601056397.py:1: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  filmi.opis.apply(
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 4972 | affect | an | arm | assassination | birth | both | by | cameron | civil | development | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| 6864 | an | baby | by | from | her | history | husban | interwoven | intoleranc | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 9968 | abus | befriend | boxer | brutal | by | chines | consequenc | district | father | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 10323 | caligar | cesar | commit | dr | hypnotist | murder | somnambulist | t | us | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 12349 | abandon | an | but | car | chil | event | for | in | jeopardy | put | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 18568902 | achiev | ag | an | at | believ | career | chapter | cricketer | end | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 18689424 | an | batman | by | clash | cris | dividing | during | existenc | fear | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 18968540 | action | after | ag | breaking | consequenc | curs | daughter | from | her | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 20850406 | after | an | back | between | blossom | camp | caught | chang | com | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 21279138 | $ | an | day | dynasty | ferdinan | hour | imeld | last | marc | of | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
9999 rows × 55 columns
Po nekaj iskanja po internetu in masiranja pridemo do iskane razpredelnice:
koreni_filmov = filmi.opis.apply(
    koreni_besed
).stack(
).reset_index(
    level='id'
).rename(columns={
    'id': 'film',
    0: 'koren',
})
koreni_filmov
/tmp/ipykernel_2057/488542747.py:1: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.
  koreni_filmov = filmi.opis.apply(
| film | koren | |
|---|---|---|
| 0 | 4972 | affect | 
| 1 | 4972 | an | 
| 2 | 4972 | arm | 
| 3 | 4972 | assassination | 
| 4 | 4972 | birth | 
| ... | ... | ... | 
| 9 | 21279138 | of | 
| 10 | 21279138 | story | 
| 11 | 21279138 | tell | 
| 12 | 21279138 | th | 
| 13 | 21279138 | untol | 
227443 rows × 2 columns
Razpredelnico združimo z razpredelnico žanrov, da dobimo razpredelnico korenov žanrov.
koreni_zanrov = pd.merge(
    koreni_filmov,
    zanri
)[['koren', 'zanr']]
koreni_zanrov
| koren | zanr | |
|---|---|---|
| 0 | affect | Drama | 
| 1 | affect | History | 
| 2 | affect | War | 
| 3 | an | Drama | 
| 4 | an | History | 
| ... | ... | ... | 
| 588190 | of | Drama | 
| 588191 | story | Drama | 
| 588192 | tell | Drama | 
| 588193 | th | Drama | 
| 588194 | untol | Drama | 
588195 rows × 2 columns
S pomočjo funkcije crosstab preštejemo, kolikokrat se vsaka kombinacija pojavi.
pojavitve_korenov_po_zanrih = pd.crosstab(koreni_zanrov.koren, koreni_zanrov.zanr)
pojavitve_korenov_po_zanrih
| zanr | Action | Adventure | Animation | Biography | Comedy | Crime | Drama | Family | Fantasy | Film-Noir | ... | Music | Musical | Mystery | Reality-TV | Romance | Sci-Fi | Sport | Thriller | War | Western | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| koren | |||||||||||||||||||||
| 2270 | 1510 | 412 | 530 | 3239 | 1888 | 5086 | 400 | 669 | 55 | ... | 242 | 117 | 982 | 1 | 1536 | 669 | 170 | 1612 | 204 | 102 | |
| $ | 233 | 144 | 27 | 159 | 326 | 184 | 676 | 38 | 56 | 2 | ... | 26 | 10 | 89 | 1 | 161 | 85 | 29 | 155 | 46 | 14 | 
| aang | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| aaron | 1 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| aart | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| ángel | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| æon | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 
| çanakkal | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| édith | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| émilien | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
15528 rows × 22 columns
Iskane verjetnosti sedaj dobimo tako, da vsak stolpec delimo s številom filmov danega žanra. Da ne bomo dobili ničelne verjetnosti pri korenih, ki se v našem vzorcu ne pojavijo, verjetnost malenkost povečamo.
verjetnosti_korenov_po_zanrih = pojavitve_korenov_po_zanrih / zanri.groupby('zanr').size() + 0.001
Poglejmo, kaj so najpogostejši koreni pri nekaj žanrih:
verjetnosti_korenov_po_zanrih.Crime.sort_values(ascending=False).head(20)
koren
          0.917060
th        0.664270
an        0.639525
t         0.599253
of        0.534721
h         0.463882
in        0.419244
on        0.245056
with      0.230015
for       0.184891
wh        0.177128
by        0.152868
when      0.146560
from      0.132490
after     0.129093
their     0.122786
murder    0.116963
her       0.115508
that      0.103863
him       0.094644
Name: Crime, dtype: float64
verjetnosti_korenov_po_zanrih.Romance.sort_values(ascending=False).head(20)
koren
         0.887836
an       0.645919
th       0.633217
t        0.574903
of       0.466935
in       0.452501
h        0.434025
with     0.326058
her      0.279291
on       0.223864
for      0.212894
lov      0.210007
wh       0.164972
their    0.157467
young    0.156889
woman    0.153425
when     0.141300
lif      0.134372
that     0.133794
man      0.126866
Name: Romance, dtype: float64
verjetnosti_korenov_po_zanrih['Sci-Fi'].sort_values(ascending=False).head(20)
koren
         0.922488
th       0.707612
an       0.662157
t        0.642873
of       0.550587
in       0.434884
h        0.356372
on       0.264085
with     0.215876
for      0.170421
from     0.169044
that     0.164912
by       0.160780
after    0.134609
their    0.134609
earth    0.124967
when     0.123590
$        0.118080
int      0.116702
wh       0.113948
Name: Sci-Fi, dtype: float64
Žanre sedaj določimo tako, da za vsak žanr pomnožimo verjetnost žanra in pogojne verjetnosti vseh korenov, ki nastopajo v opisu filma.
def doloci_zanre(opis):
    faktorji_zanrov = verjetnosti_zanrov * verjetnosti_korenov_po_zanrih[
        verjetnosti_korenov_po_zanrih.index.isin(
            koreni_besed(opis)
        )
    ].prod()
    faktorji_zanrov /= max(faktorji_zanrov)
    return faktorji_zanrov.sort_values(ascending=False).head(5)
doloci_zanre('Alien space ship appears above Slovenia.')
zanr
Sci-Fi       1.000000
Adventure    0.990078
Action       0.454610
Animation    0.335296
Horror       0.122791
dtype: float64
doloci_zanre('A story about a young mathematician, who discovers her artistic side')
zanr
Drama        1.000000
Biography    0.824923
Romance      0.375775
Musical      0.118636
Comedy       0.099659
dtype: float64