Monday, May 16, 2022

[SOLVED] How to improve my Algorithm to find the Hot-Topics like twitter does

May 16, 2022 cron, php

Issue

I have created a cron job for my website which runs every 2hours and it counts the words in the feeds and then displays the 10 highest count words as the hot topics.

Something that Twitter does on their homepage, is to show the most popular topics that are being discussed.

What my cron job does right now is it counts the words except for the words that i have mentioned, words like:

array('of', 'a', 'an', 'also', 'besides', 'equally', 'further', 'furthermore', 'in', 'addition', 'moreover', 'too',
                        'after', 'before', 'when', 'while', 'as', 'by', 'the', 'that', 'since', 'until', 'soon', 'once', 'so', 'whenever', 'every', 'first', 'last',
                        'because', 'even', 'though', 'although', 'whereas', 'while', 'if', 'unless', 'only', 'whether', 'or', 'not', 'even',
                        'also', 'besides', 'equally', 'further', 'furthermore', 'addition', 'moreover', 'next', 'too',
                        'likewise', 'moreover', 'however', 'contrary', 'other', 'hand', 'contrast', 'nevertheless', 'brief', 'summary', 'short',
                        'for', 'example', 'for instance', 'fact', 'finally', 'in brief', 'in conclusion', 'in other words', 'in short', 'in summary', 'therefore',
                        'accordingly', 'as a result', 'consequently', 'for this reason', 'afterward', 'in the meantime', 'later', 'meanwhile', 'second', 'earlier', 'finally', 'soon', 'still', 'then', 'third');       //words that are negligible

But this does not completely solve the issue of eliminating all the non-required words. And give only the words that are useful.

Can someone please guide me on this, and tell me how can I improve my algorithm.

Solution

Here's how we implemented this for the DjangoDose live feed during DjangoCon (note: this is a hackjob, we wrote it in 1 afternoon with no testing, and be yelling Bifurcation occsaionally, as best I can tell bifurcation has nothing to do with anything). All that being said, it more or less worked for us (meaning in the evenings beer was tracked appropriately).

IGNORED_WORDS = set(open(os.path.join(settings.ROOT_PATH, 'djangocon', 'ignores.txt')).read().split())

def trending_topics(request):
    logs = sorted(os.listdir(LOG_DIRECTORY), reverse=True)[:4]
    tweets = []
    for log in logs:
        f = open(os.path.join(LOG_DIRECTORY, log), 'r')
        for line in f:
            tweets.append(simplejson.loads(line)['text'])
    words = defaultdict(int)
    for text in tweets:
        prev = None
        for word in text.split():
            word = word.strip(string.punctuation).lower()
            if word.lower() not in IGNORED_WORDS and word:
                words[word] += 1
                if prev is not None:
                    words['%s %s' % (prev, word)] += 1
                    words[prev] -= 1
                    words[word] -= 1
                prev = word
            else:
                prev = None
    trending = sorted(words.items(), key=lambda o: o[1], reverse=True)[:15]
    if request.user.is_staff:
        trending = ['%s - %s' % (word, count) for word, count in trending]
    else:
        trending = [word for word, count in trending]
    return HttpResponse(simplejson.dumps(trending))

Answered By - Alex Gaynor

Answer Checked By - David Goodson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, May 16, 2022

[SOLVED] How to improve my Algorithm to find the Hot-Topics like twitter does

Issue

Solution

Popular Posts

Labels