[] Reddit Comment Summarizer - A Lang Chain Project

Reddit Comment Summarizer - A chatGPT API Project

Introduction

Hate reading all the comments on Reddit? Want to know what people are discussing quickly?

I used Reddit's API and Lang Chain to create a Reddit post comment summarizer. This website application allows you to input any Reddit post, and summarize the comment threads content in a click of a button. The goal of this project is to promote information literacy and help users to be smart, critical, and non-bias when consuming online information.

Working principle

The scrapped comment from Reddit is filtered, formatted into a prompt string that is then sent to OpenAI's ChatGPT API with the Lang Chain framework. Basically the prompt contains all the comment content of interest, and GPT will perform a specific summary with engineered prompting that was designed by me. The output of GPT is then sent back to the Web App and displayed.

Demo

Presentation PPT: final submission.pdf

Python files: Reddit summarizer

The algorithm behind

First, I use Reddit's API to scrape comment text. A nested search on each comment thread is performed. The algorithm can be set to scrape a specific depth of comment threads, a specific number of comment threads, or by specific commenters.

Here is the function scrapping the comments:

def scrape_comments(self):
        dp = DebugPrinter(working=self.debug)
        submission = self.submission
        if type(submission) == None:
            return "No post submittion retrieved!"
        self.post_title = submission.title

        # ------scrape post
        terminate = False
        total_token_count = 0
        # token_limit = 7000 # leave room for the ~8000 token limit
        queue = []
        queue_set_pointer = 1 # scrape first layer comments by sets of 10s
        back_up_queue = []
        scraped_comments = []
        # num_comment_layer = [10,10,3,3,3,3,10,10,10,10]
        # max_depth = 7
        depth = 0

        while not terminate:
            dp.dprint(f"depth: {depth}, queue len: {len(queue)}")
            indent = "| "*depth
            next_queue = []

            if len(queue) == 0:
                if len(back_up_queue) > 0:
                    dp.dprint("#\n"*3+"."*10+f"scraping backup queue #{len(back_up_queue)}")
                    queue = back_up_queue
                    back_up_queue = []
                else:
                    depth = 0
                    indent = "| "*depth
                    for i in range((queue_set_pointer-1)*self.num_comment_layer[0],
                                (queue_set_pointer)*self.num_comment_layer[0]):
                        if i >= len(submission.comments):
                            terminate = True
                            break
                        comment = self.MyComment(submission.comments[i], [i+1], depth)
                        queue.append(comment)
                    queue_set_pointer+=1

            for i, que in enumerate(queue):
                # scrape queued comments
                try:
                    body = que.comment.body
                except:
                    dp.dprint("")
                    continue

                this_comment = que.comment
                dp.dprint(f"{indent} this comment pos: {que.pos}, total_token_before: {total_token_count}")  
                dp.dprint(f"{indent} comment: {body[:20]}...")
                scraped_comments.append(que)
                token_count = str_token_count(this_comment.body)
                total_token_count+=token_count

                if total_token_count > self.token_limit:
                    terminate = True
                    break
                try:
                    foo = this_comment.replies[0]
                except:
                    dp.dprint("")
                    continue
                if depth>=self.max_depth:
                    dp.dprint("")
                    continue
                
                # add new comments to queue
                if type(this_comment.replies[0]) == praw.models.reddit.more.MoreComments:
                    replies = this_comment.replies[0].comments(0)
                else:
                    replies = this_comment.replies
                dp.dprint(f"reply type: {type(this_comment.replies[0])}")
                for j, reply in enumerate(replies):
                    dp.dprint(f"{indent} reply # {j+1}") 
                    reply_pos = que.pos.copy()
                    reply_pos.append(j+1)
                    queued_comment = self.MyComment(reply, reply_pos, depth)
                    if j+1 <= self.num_comment_layer[depth+1]:
                        dp.dprint(" *queued")
                        next_queue.append(queued_comment)
                    else:
                        dp.dprint("")
                        back_up_queue.append(queued_comment)
                dp.dprint("")

            queue = next_queue
            depth+=1
        return scraped_comments

Then, I filter and format the scrapped information to be sent to GPT. The comments are grouped by author, index, or range of prime comments.

The Prompt

The scrapped comment is inserted into this prompt as a chunk of string at the {discussion} position. This prompt is then sent to GPT to get a summary.

Below is the prompt structure.

template_3 = """Generate a information summary of a Reddit post called "{title}" based on all the comments. 
The overview should include a summary and a reflection.

For the summary, write a point-form summary of the discussion. 
- Help me get a comprehensive understanding of the key points of discussion. 
- When you mention post-specific nouns and words, you should explain clearly what they mean in context.
- Reference the discussion number if possible.

For the reflection, write a reflection after reading the discussion. 
- Be a reading mentor for me. Analyze the discussion critically.
- Provide unique and insightful opinions of the discussion. Critique biases and highlight high quality arguments.

Here are the comments:
{discussion}

Output the following response in markdown format.
- summary:
- reflection:"""