Brojanje reči

Branimir_Maksimovic

Well-known member
Joined
Nov 22, 2018
Messages
928
Reaction score
370
Elem imamo tekst:
Free Bible Download
Uzeti text verziju.
Zadatak je prebrojati koliko puta se ponavlja koja rec nakon sto se text
pretvori u lowercase i ocisti od interpunkcije.
Prikazati reverzno sortirano po frekvenciji prvih 20 reci.
Mala zanimacija, interesantno je videti koliko moze biti efikasno
i elegantno u isto vreme. Svi jezici dozvoljeni pa i shell ;0)
Za pocetak evo elegantne ekstremno neefikasne implentacije
u Haskell-u:
Code:
~/.../d/bible >>> cat bbl.hs   
import Data.List (sort,group)
import Control.Arrow ((&&&))
import Data.Char (toLower,isPunctuation)
import Text.Printf (printf)

main = do
  contents <- readFile "bible.txt"
  let result = reverse.sort $
                        map (length &&& head) $
                        group.sort.words $
                        map toLower $ filter (not.isPunctuation) contents
  mapM_ (\(x,y) -> printf "%8d %s\n" x y) $ take 20 result
i rezultat:
Code:
  63924 the
  51695 and
  34734 of
  13561 to
  12913 that
  12667 in
  10420 he
  9838 shall
  8997 unto
  8971 for
  8854 i
  8473 his
  8177 a
  7830 lord
  7376 they
  7013 be
  6989 is
  6659 him
  6596 not
  6429 them
Jezici mogu da se razlikuju po tome sta spada u interpunkcijse znake ili ne:)
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
evo elegantne ekstremno neefikasne implentacije
Koliko traje izvršavanje/brojanje reči?
 
Last edited:

Branimir_Maksimovic

Well-known member
Joined
Nov 22, 2018
Messages
928
Reaction score
370
./bbl 3.79s user 0.12s system 99% cpu 3.938 total

sledi efikasnija verzija od ove …
 
Last edited:

Branimir_Maksimovic

Well-known member
Joined
Nov 22, 2018
Messages
928
Reaction score
370
Sa obzirom da je String u Haskell-u linked lista operacije sa njim su cisto u edukativne svrhe 😛
sledi verzija sa ByteStringom, tj nizom karaktera:
Code:
import Data.List (sort,group)
import Control.Arrow ((&&&))
import Data.Char (toLower,isPunctuation)
import Text.Printf (printf)
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as B8

main = do
  contents <- B.readFile "bible.txt"
  let result = reverse.sort $
                      map (length &&& head) $
                      group.sort.B8.words $
                      B8.map toLower $ B8.filter (not.isPunctuation) contents
  mapM_ (\(x,y) -> printf "%8d %s\n" x (B8.unpack y)) $ take 20 result
ovo traje duplo krace ;p
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
Običan one-liner awk, sa sve sortiranjem 🙂
awk '{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}' bible.txt | sort -k 2 -n
Vreme izvršavanja: 1.24 sekunde

Disclaimer:
NISAM programer 🙂
View attachment 5483
 
Last edited:

Branimir_Maksimovic

Well-known member
Joined
Nov 22, 2018
Messages
928
Reaction score
370
I na kraju verzija sa hashtabelom gde elegancija gubi na svojoj vrednosti 😉

Code:
{-# Language BangPatterns #-}
import qualified Data.HashTable.IO as H
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as B8
import Data.List
import Text.Printf
import Data.Char
import Data.IORef

type HashTable k v = H.BasicHashTable k v

wordFreq :: [B.ByteString] -> IO [(B.ByteString, Int)]
wordFreq xs = do
ht <- H.new :: IO (HashTable B.ByteString (IORef Int))
mapM_ (updatefreq ht) xs
lst <- H.toList ht
mapM ((x,y)-> do
v <- readIORef y
return (x,v)) lst
where updatefreq ht word = do
!lu <- H.lookup ht word
case lu of
Nothing -> do
ref <- newIORef 1
H.insert ht word ref
Just x -> modifyIORef’ x (+1)
return ()

main = do
contents <- B.readFile “bible.txt”
result <- wordFreq.B8.words $
B8.map toLower $ B8.filter (not.isPunct) contents
let sorted = reverse.sort $ map ((x,y) -> (y,x)) $ result
mapM_ ((x,y) -> printf “%8d %s\n” x (B8.unpack y)) $ take 20 sorted

isPunct c =
c == ‘’’ || c == ‘.’ || c == ‘;’ || c == ‘(’ || c == ‘)’
|| c == ‘"’ || c == ‘?’ || c == ‘-’ || c == ‘_’ || c == ‘!’
|| c == ‘,’ || c == ‘:’ || c == ‘|’
Code:
  63924 the
  51696 and
  34734 of
  13561 to
  12913 that
  12667 in
  10420 he
  9838 shall
  8997 unto
  8971 for
  8854 i
  8473 his
  8177 a
  7830 lord
  7376 they
  7013 be
  6989 is
  6659 him
  6596 not
  6430 them
./bbl4  0.19s user 0.01s system 97% cpu 0.212 total
Gle to je samo 25 puta brze ;p
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
A evo i moje skriptice koju koristim da dobijem vreme izvršavanja 🙂

Code:
#!/bin/bash
start=$(date +%s.%N); 
sleep 0.1s; 
./counting.sh; \

time=$(echo “$(date +%s.%N) - $start” | bc); 
printf “Vreme izvršavanja : %.2f sekundi\n” $time

awk sam smestio u posebnu skripticu, counting.sh
 
Last edited:

Branimir_Maksimovic

Well-known member
Joined
Nov 22, 2018
Messages
928
Reaction score
370
Običan one-liner awk, sa sve sortiranjem 🙂
awk '{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}' bible.txt | sort -k 2 -n
Vreme izvršavanja: 1.24 sekunde

Disclaimer:
NISAM programer 🙂
View attachment 5483
Nije dobar rezultat, moras da pretvoris u lowercase, izbacis interpunkciju i sortiras reverzno 😉
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
Ok…boja je kolor, a Boja postoji i kao žensko ime…prvo što mi je palo na pamet 🙂
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
Osim toga, prošlo je vreme MSDOS-a, fajl sistemi razlikuju mala i velika slova 🙂
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
@Branimir Maksimovic
Ama, semantika je realnost, zato i pucaju programi koji je zanemaruju 😃
 
Last edited:

Branimir_Maksimovic

Well-known member
Joined
Nov 22, 2018
Messages
928
Reaction score
370
Kad smo pominjali Go, evo isto to u Go-u 😛

Code:
package main

import (
“fmt”
“unicode”
“io/ioutil”
“strings”
“sort”
)

func main() {
buf, err := ioutil.ReadFile(“bible.txt”)
if err != nil {
fmt.Println(err)
return
}
content := make(map[string]int)
filtered := make([]byte,0,len(buf))
for _,v := range buf {
if !IsPunct(v) {
v = byte(unicode.ToLower(rune(v)))
filtered = append(filtered,v)
}
}
fmt.Printf("%d\n%d\n",len(buf),len(filtered))
words := strings.Fields(string(filtered))
for _,v := range words {
content[v]++
}
output := make(PairList,0,len(content))
for k,v := range content {
output = append(output,Pair{v,k})
}
sort.Sort(output)
for i := 0;i<20;i++ {
fmt.Printf("%8d %s\n",output[i].count,output[i].word)
}
}

func IsPunct(c byte)bool {
return c == ‘’’ || c == ‘.’ || c == ‘;’ || c == ‘(’ ||
c == ‘)’ || c == ‘"’ || c == ‘?’ || c == ‘-’ || c == ‘_’ || c == ‘!’ ||
c == ‘,’ || c == ‘:’ || c == ‘|’
}

type Pair struct {
count int
word string
}

type PairList []Pair

func (this PairList) Len()int { return len(this) }
func (this PairList) Swap(i,j int) { this[i],this[j] = this[j],this[i] }
func (this PairList) Less(i,j int)bool { return this[i].count > this[j].count }

~/…/d/bible >>> time ./bbl
4587478
4430654
63924 the
51696 and
34734 of
13561 to
12913 that
12667 in
10420 he
9838 shall
8997 unto
8971 for
8854 i
8473 his
8177 a
7830 lord
7376 they
7013 be
6989 is
6659 him
6596 not
6430 them
./bbl 0.12s user 0.04s system 135% cpu 0.122 total

[/code]

Deluje da je Go efikasniji 😉
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
Nego…instalirao sam haskell, ali ne mogu da pokrenem program koji si okačio…gde grešim?
Ispljune greške, napravi i neke fajlove, ali nix…
View attachment 5484
 
Last edited:

Branimir_Maksimovic

Well-known member
Joined
Nov 22, 2018
Messages
928
Reaction score
370
Ma to kompajliras sa ghc-om 😉
Ovako:
Code:
~/.../d/bible >>> ghc -O2 bbl.hs  
[1 of 1] Compiling Main  ( bbl.hs, bbl.o )
Linking bbl ...
I onda startujes executable. Taj primer sa hashtable treba da instaliras jos i
cabal update
pa
cabal install hashtables
i
cabal install bytestring
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
Uf…kakav sam magarac, da previdim da mora da se kompajlira…tnx 🙂
 
Last edited:

Dragan

Well-known member
Staff member
Joined
Jan 13, 2012
Messages
6,371
Reaction score
65
Ne ide kompajliranje tvoje sors skripte kako treba…koju verziju ghc i haskella koristiš?
 
Last edited:
Top