altar9.jpg

Artificially Intelligent Pastor

Developing a language model from transcripts of live-streamed messages using TensorFlow.

The words in bold indicate output from the AI Pastor; the views and beliefs expressed are not that of the author or any organization.

 

“Then he told them that Jesus spoke to a deserted place and his companions hunted for five eggs. For they knew everybody, because the story gave us what God has lost.”

Searching for a large, unique dataset I had access to, I realized I could download subtitles from videos uploaded or streamed to YouTube. While AV Coordinator at Providence United Methodist Church, I had 2 years previously set up the church’s live streaming abilities. The AI Pastor, or multi-layer recurrent neural network (RNN) is trained on about 100 scriptures and sermons over the course of 2017 to 2019.

After individually downloading each text file from the YouTube system, I created a simple python script to clean the .sbv text files of time-codes and empty lines, then concatenate everything into a single output.txt. Once you have the very long output text file it’s pretty easy to run it though TensorFlow in a python environment following Sung Kim’s tinyShakespear example. Easy to take advantage of TensorFlow’s NVIDIA GPU support for faster training too.

 

“What I love about this scripture, if there's an attitude of this piece of 29, what I offer is a word of God before your servant and before those who curse us. Our prayer is dynamic. Give us the Lord in a prayer.”

Unfortunately, you get out what you put into the RNN, and the automatically generated subtitles don’t contain any formatting. Certain nouns are automatically capitalized, and for readability purposes, I formatted the outputs with punctuation. Once punctuation is added, a word processor can be used to add more punctuation, capital letters, and to suggest and fix grammar. Changes made remain to as close to the original output as possible, while loosely maintaining thoughts and sentence structure.

“The Lord jumped down to him but son of David they were much. Humpty Dumpty found a woman's suit, a choice we're only in once - so she prayed. The Samaritan year was still one last time, and she was convinced that it was still saying it's a way that he can call. Because repentance in the book of which they worship, Paul, and John 3, and the slave, used them to understand seven compassion and turn in to control upon Peter.”

Training got better after I tweaked some settings, looking for an increase in competence. The input dataset never changed, but in my most recent model it seems to make more creative use of the potential vocabulary, while still using common connecting words and phrases.

Preprocessing is a step I need to revisit. If I could batch format the subtitles before they go into the RNN, then the model could potentially learn sentences. This knowledge would only ever be as good as the word processor used to create the formatting however.

 
 
 

subtitleScript.py

"""
This python script is designed to remove empty lines, as well as time-stamp lines from a directory of
.sbv files generated automatically from YouTube/Google speech to text. Then concatenate the files into one .txt file for training.
@author: Tyler Griggs
"""

import glob
import os

def format_subtitles(filename):
    if not os.path.isfile(filename):
        print("{} does not exist ".format(filename))
        return
    with open(filename) as filehandle:
        lines = filehandle.readlines()

    with open(filename, 'w') as filehandle:
        lines = filter(lambda x: not ":" in x, lines)
        lines = filter(lambda x: x.strip(), lines)
        filehandle.writelines(lines)


path = os.path.dirname(os.path.abspath(__file__))
txtlist = glob.glob(path + '\*.sbv')

for file in txtlist:
    format_subtitles(file)

with open('output.txt', 'w') as outfile:
    for file in txtlist:
        with open(file) as infile:
            outfile.write(infile.read())
Previous
Previous

ML Protein Interaction

Next
Next

ForFit