I started working on Crystal in 2016. I had just finished making the switch over to Linux, and I had recently discovered how great Python is (especially on Linux). The virtual assistants I had access to were Siri and Ok Google. They were good, all the little bugs had been flushed out since their initial launches, but they were limited in scope. All they could do was limited to my phone and the cloud. If I wanted to do something on my computer via voice command, that just wasn't possible. I wanted something expandable, something I could tinker around with, something more personalized.
I decided I would do it myself.
First Version (Legacy)
I'll admit, my first iteration of Crystal was really weird. Here's how it worked:
There's this really convenient Python module called
SpeechRecognition. It has built in functionality to just listen, and then it calls a callback function when it recognizes speech. Then you can call another function to recognize speech in that audio using any cloud service's API that it supports. So that's exactly how Crystal works, using Google's Speech API for Chromium.
Once we have the text, we need to figure out what the user wants to do. I solved this by using another API. This time, IBM Watson's text classifier. (My machine learning knowledge was non-existent at this point.) I created a big list of anticipated queries, and mapped them to a category. And it worked pretty good.
After that, we need to extract parameters from the user's query (if necessary). For example, if we are switching workspaces, we need to know which one to switch to. I used another Python module for NLP called spaCy. It's a very high quality, professional library that I would highly recommend to anybody who is doing any language processing.
Once we know the user's intent and the parameters, we perform the requested action.
1. Response Times
From the beginning, response times were pretty bad, and sometimes it would just not respond at all. I was able to trace this back to how the
SpeechRecognition module handles recording audio. If I want to fix that, I need to replace that module with my own solution.
My text classification solution was also contributing to the response times, since it required a web request to get the classification. But it only affected responses by milliseconds instead of seconds.
My current methods for parsing are very manual. For every new command I add, I end up having to do some parsing by hand. Ideally, I could train something to determine and extract parameters. Turns out this is an active field of research, so it's probably way out of my league (but it won't stop me from trying).
3. Poor Code Quality
This version of Crystal was an absolute mess. Originally, I intended for Crystal to be a relatively small project. I was just going to program a few voice commands, like switching workspaces, controlling media playback, etc. But I wanted it to do more, so I kept adding stuff, like finding the definition of words. Eventually, it got super bloated and response times were already abysmal. In my main file, there were over 2500 lines that could easily be split into multiple files if I had done that from the start.
I came to the conclusion that some major refactoring was necessary.
This is the only project that I have ever been motivated enough to successfully rewrite it from the ground up (because it's so god damn cool).
TODO: add diagram image
Crystal is now made of modules. There are 3 types of modules:
- Input modules
- Action modules
- Feedback modules
All audio input from the user is handled by the input modules. They take the audio and recognize speech and put it into text. Then, it uses
scikit-learn to classify the query, and passes it into it's respective Action module. Each Action module handles one query classification. Feedback modules handle any kind of response or indicator shown to the user. Feedback modules can hook in anywhere on the request/response pipeline.
Input modules are the first step in the pipeline. They handle recording audio, or grabbing audio from somewhere else. Then, they recognize speech in the audio. If there is any recognized text, it passes it down the pipeline.
After Crystal classifies the query, the text gets passed to it's respective Action module. Action modules don't handle displaying any kind of result or visual feedback to the user, they just parse the query, and perform their action or request additional parameters from the user.
Here are the modules that I've refined to the point to where they usually don't break:
These modules are either WIP or planned:
As previously mentioned, Feedback modules can hook in anywhere along the request-response pipeline. Currently, there are only a couple Feedback modules. I have one that sends commands to a
blink(1) to change the color when Crystal's status changes. Another to send system notifications with query responses.
Current State of the Project
This project is in active development. This project is not yet open source, but source code is available on request. The scope of this project is limited to my personal use, meaning I do not intend nor expect anybody else to use this project on a daily basis.
In addition to more Feedback and Action modules, I'm also working on making this project work fully offline. No internet connection required. I found this handy little project called cheetah that I hope will help make this happen.