Issue
I have created a cron job for my website which runs every 2hours and it counts the words in the feeds and then displays the 10 highest count words as the hot topics.
Something that Twitter does on their homepage, is to show the most popular topics that are being discussed.
What my cron job does right now is it counts the words except for the words that i have mentioned, words like:
array('of', 'a', 'an', 'also', 'besides', 'equally', 'further', 'furthermore', 'in', 'addition', 'moreover', 'too',
'after', 'before', 'when', 'while', 'as', 'by', 'the', 'that', 'since', 'until', 'soon', 'once', 'so', 'whenever', 'every', 'first', 'last',
'because', 'even', 'though', 'although', 'whereas', 'while', 'if', 'unless', 'only', 'whether', 'or', 'not', 'even',
'also', 'besides', 'equally', 'further', 'furthermore', 'addition', 'moreover', 'next', 'too',
'likewise', 'moreover', 'however', 'contrary', 'other', 'hand', 'contrast', 'nevertheless', 'brief', 'summary', 'short',
'for', 'example', 'for instance', 'fact', 'finally', 'in brief', 'in conclusion', 'in other words', 'in short', 'in summary', 'therefore',
'accordingly', 'as a result', 'consequently', 'for this reason', 'afterward', 'in the meantime', 'later', 'meanwhile', 'second', 'earlier', 'finally', 'soon', 'still', 'then', 'third'); //words that are negligible
But this does not completely solve the issue of eliminating all the non-required words. And give only the words that are useful.
Can someone please guide me on this, and tell me how can I improve my algorithm.
Solution
Here's how we implemented this for the DjangoDose live feed during DjangoCon (note: this is a hackjob, we wrote it in 1 afternoon with no testing, and be yelling Bifurcation occsaionally, as best I can tell bifurcation has nothing to do with anything). All that being said, it more or less worked for us (meaning in the evenings beer was tracked appropriately).
IGNORED_WORDS = set(open(os.path.join(settings.ROOT_PATH, 'djangocon', 'ignores.txt')).read().split())
def trending_topics(request):
logs = sorted(os.listdir(LOG_DIRECTORY), reverse=True)[:4]
tweets = []
for log in logs:
f = open(os.path.join(LOG_DIRECTORY, log), 'r')
for line in f:
tweets.append(simplejson.loads(line)['text'])
words = defaultdict(int)
for text in tweets:
prev = None
for word in text.split():
word = word.strip(string.punctuation).lower()
if word.lower() not in IGNORED_WORDS and word:
words[word] += 1
if prev is not None:
words['%s %s' % (prev, word)] += 1
words[prev] -= 1
words[word] -= 1
prev = word
else:
prev = None
trending = sorted(words.items(), key=lambda o: o[1], reverse=True)[:15]
if request.user.is_staff:
trending = ['%s - %s' % (word, count) for word, count in trending]
else:
trending = [word for word, count in trending]
return HttpResponse(simplejson.dumps(trending))
Answered By - Alex Gaynor Answer Checked By - David Goodson (WPSolving Volunteer)