python Programming Glossary: corpus

http://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator

I am trying to get a random sample from a very large text corpus. The problem is that random.sample raises the following error...

Python Multiprocessing storing data until further call in each process

http://stackoverflow.com/questions/14437944/python-multiprocessing-storing-data-until-further-call-in-each-process

code sample tfidf_vect ftext.TfidfVectorizer N 100000 corpus 'This is the first document.' 'This is the second second document.'.. 'Before fit_transform' X tfidf_vect.fit_transform corpus model lm.LogisticRegression model.fit X y report_memory 'After..

How to calculate cosine similarity given 2 sentence strings? - Python

http://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python

in order to use tf idf you need to have a reasonably large corpus from which to estimate tfidf weights. You can also develop it..

How to get the wordnet sense frequency of a synset in NLTK?

http://stackoverflow.com/questions/15551195/how-to-get-the-wordnet-sense-frequency-of-a-synset-in-nltk

According to the documentation i can load a sense tagged corpus in nltk as such from nltk.corpus import wordnet_ic brown_ic.. i can load a sense tagged corpus in nltk as such from nltk.corpus import wordnet_ic brown_ic wordnet_ic.ic 'ic brown.dat' semcor_ic.. But how can get the frequency of a synset from a corpus To break down the question first how to count many times did..

POS tagging in German

http://stackoverflow.com/questions/1639855/pos-tagging-in-german

they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL.. help it tokenize German correctly. I believe the EUROPARL corpus might help get you going. See nltk.corpus.europarl.german this.. the EUROPARL corpus might help get you going. See nltk.corpus.europarl.german this is what you're looking for. Also consider..

Feedparser - retrieve old messages from Google Reader

http://stackoverflow.com/questions/1676223/feedparser-retrieve-old-messages-from-google-reader

my intent is to do Natural Language Processing over this corpus and would like to be able to retrieve many past entries from..

How is it that json serialization is so much faster than yaml serialization in python?

http://stackoverflow.com/questions/2451732/how-is-it-that-json-serialization-is-so-much-faster-than-yaml-serialization-in-p

orders of magnitude without some profiling data and a big corpus of examples. In any case be sure to test over a large body of..

Iterating through String word at a time in Python

http://stackoverflow.com/questions/2768628/iterating-through-string-word-at-a-time-in-python

I tried using re module matches. But As i have a huge text corpus that i have to search through. This is taking large amount of..

How do I count words in an nltk plaintextcorpus faster?

http://stackoverflow.com/questions/3902044/how-do-i-count-words-in-an-nltk-plaintextcorpus-faster

do I count words in an nltk plaintextcorpus faster I have a set of documents and I want to return a list.. this project done faster def searchText searchword counts corpus_root 'some_dir' wordlists PlaintextCorpusReader corpus_root.. corpus_root 'some_dir' wordlists PlaintextCorpusReader corpus_root '. ' for id in wordlists.fileids date id 4 12 month date..

Creating a new corpus with NLTK

http://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk

a new corpus with NLTK I reckoned that often the answer to my title is to.. a bunch of .txt files and i want to be able to use the corpus functions that NLTK provides for the corpus nltk_data. i've.. to use the corpus functions that NLTK provides for the corpus nltk_data. i've tried PlaintextCorpusReader but i couldn't get..

custom tagging with nltk

http://stackoverflow.com/questions/5919355/custom-tagging-with-nltk

How do i find the frequency count of a word in English using WordNet?

http://stackoverflow.com/questions/5928704/how-do-i-find-the-frequency-count-of-a-word-in-english-using-wordnet

corpora wordnet cntlist.rev . Code example from nltk.corpus import wordnet syns wordnet.synsets 'stack' for s in syns for.. in the source file or in the documentation which corpus was used to create this data. So it's probably best to choose.. to create this data. So it's probably best to choose the corpus that fits best to the your application and create the data yourself..

Fast n-gram calculation

http://stackoverflow.com/questions/7591258/fast-n-gram-calculation

calculation I'm using NLTK to search for n grams in a corpus but it's taking a very long time in some cases. I've noticed.. there's a potentially faster way of finding n grams in my corpus if I abandon NLTK If so what can I use to speed things up python..