Thought2Text

Let's make it possible to go from thoughts -> text. Yeah, brain reading shit.

We can begin this work by predicting what words people are listening to. We can use a similar approach to Semantic reconstruction of continuous language from non-invasive brain recordings, but instead of using fMRI, we use low-cost, non-invasive EEG. This will have the added benefit of being much more accessible, and hopefully more insightful, since it will be much lower latency (although more noisy).

Here's the basic pipeline:

We start by collecting EEG recordings of one person, while they are watching several videos. We must have time-synced captions available for these videos. Let's go for a long time period: I'm thinking 10 hours of recordings.

Next, we train a encoder-decoder (transformer encodes text, diffusion model decodes EEG signal) model to predict an EEG recording given a word sequence. This is the most important model, that we will actually be training. It has to be good! Which is why we collect a lot of training data.

Now, we use language models! They're pretty good at completing sequences. The plan is to use LMs to generate word sequences, predict EEG recordings for these word sequences, and compare these predictions to the actual recording to find the most similar ones. And we keep doing this... there, we might have thought2text!

Some points in random order of things to think about:

How do we kickstart the LM generation?
Do we use a fixed time period of EEG recordings we predict? How do we know how long the EEG recording we're comparing with our predictions should be?
Maybe we should use a sliding window of time to generate better predictions. Eg. generate first 10s prediction, then when you have another second of EEG recordings, use that one, and the previous 9 seconds to generate a new word seqeunce. Score all options and keep the best.
If this works... we can do this for actual imagined speech! We can create all kinds of new interfaces controlle by the brain.
We'll be using a $200 EEG recording device. That's mad cheap. The reason I'm optimistic about this working is because we'll be treating the EEG data in a long timeframe, and treat it like non-continous data. We'll have a 10s recording and a sentence — vs sampling at a 100hz trying to make predictions. This should hopefully give us more signal than noise. This device will record from the frontal and temporal lobes, which have fair correlation with language processing in the brain too!
Should we start by doing this only for one person? Or for multiple? Should we try to train a neural net that's generalizable over several brains? Tang et al's paper seems to suggest that doesn't work out too well, so I think we should train separately for every individual.
From gautam: use an encoder-decoder model. transformer encoder-diffusion decoder. Look at wavegrad and diffwave.