Show Menu
Cheatography

text mining cheatsheet

prepro­cessing pipeline

def corpus2docs(corpus):
    fids = corpus.fileids()
    docs1 = []
    for fid in fids:
        doc_raw = corpus.raw(fid)
        doc = nltk.word_tokenize(doc_raw)
        docs1.append(doc)
    docs2 = [[w.lower() for w in doc] for doc in docs1]
    docs3 = [[w for w in doc if re.search('^[a-z]+$', w)] for doc in docs2]
    docs4 = [[w for w in doc if w not in stop_list] for doc in docs3]
    docs5 = [[stemmer.stem(w) for w in doc] for doc in docs4]
    return docs5

def docs2vecs(docs, dictionary):
    # docs is a list of documents returned by corpus2docs.
    # dictionary is a gensim.corpora.Dictionary object.
    vecs1 = [dictionary.doc2bow(doc) for doc in docs]
    tfidf = gensim.models.TfidfModel(vecs1)
    vecs2 = [tfidf[vec] for vec in vecs1]
    return vecs2

POS tags

N (noun)
dog, cat, chair
V (verb)
read, write, get
ADJ (adjec­tive)
pretty, smart, blue
ADV (adverb)
gently, carefully, extremely
P (prepo­sition)
in, on, by, with, about
PRO (pronoun)
I, me, mine, it, they...
CON (conju­nction)
and, or, but, while, because
INT (inter­jec­tion)
ooh, wow, yeah
DET (deter­miner)
all, his, they
AUX (auxiliary verb)
have done, might do
PAR (particle)
look up, get on
NUM (numeral)
one, two, three

Contex­t-free grammar

Grammar = {
    objects: [
        Words/tokens: terminals,
        Right above: pos tags,
        Above: syntactic tags,
        Above: sentence
    ];
    Rules: [
        X: node name,    #eg "VP" (verb phrase)
        Y: sequence of objects that make up X    #eg (V+NP)
    ]
}

Morphemes

stems, affixes (prefi­x/s­uffix). Useful for POS tagging and text normal­ization

Semantics

synonyms
diff words, same meaning
polyseme
same word, diff meaning
hypern­ym/­hyponym
category >>> specific
merony­m/m­etonym
part >>> whole
 

LDA

LDA

gibbs sampling
1. random word-t­o-topic assignment
 
2. re-assign each word to a topic, one by one, assuming all other assign­ments are correct
hyperp­ara­meters
high $alpha$ --> documents feature a mixture of most topics
 
high $eta$ --> topics feature a mixture of most words
evaluation
coherence (PMI), human eval

Sentim­ent­-Topic Model (Plate Notation)

 

Cluster Purity

Overall purity

Cluster Entropy

Pointwise Mutual Inform­ation

Discourse Markers

causal
because
conseq­uence
as a result
condit­ional
if
temporal
when
additive
and
elabor­ation
[exemp­lif­ica­tion, re-wor­ding]
contra­sti­ve/­con­cessive
but

Prepar­ation for NLTK classifier

#doc_tuple = (doc_r­epr­ese­nta­tion, label)
>
({'police':1, 'lawye­r':1, 'court­':1}, 'Crime')

#train_set = [doc_t­uple1, doc_tu­ple2, ...]
           
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          The MythoSelf Process™ Cheat Sheet
          Natural Language Processing with Python & nltk Cheat Sheet
          NLP Cheat Sheet