Making Transformers Sing - with Mikey Shulman of Suno

[Music] hey everyone welcome to the Len space podcast this is alesio partner and CTO and residents and deciel partners and I'm joined by my cohost swix founder of small AI hey and today we are in the uh remote studio with Mikey Schulman welcome thank you it's great to be here uh so I like to go over people's background on LinkedIn and then maybe find a little bit more outside of LinkedIn um you did your Bachelor's in physics and and then a PhD in physics as well um before also before going into

kenow Technologies with the home of a lot of AI um uh top EI startups it seems like where you're hit of machine learning for uh seven years um you're also a Lector at MIT uh which we talk about that like uh you know what you talked about and then uh about uh two years ago you left to uh start sunno uh which is recently burst on the scene as one of the top music generation startups um so we can talk we can go over that bio but also I guess what's not on your Linkin that people should know about y

ou I love music um I am a aspiring mediocre musician uh I wish I were better but that doesn't make me not enjoy playing real music um uh and I also love coffee I I'm probably way too much into coffee are you one of those people that uh you know they do the Tik toks they use like 50 tools to like grind the beans and then like brush them and then like spray them like what what level what level are we talking about here I confess there's a a spray bottle for beans in the next in the next room there

is one of those weird comb tools so so guilty I don't put it on Tik Tok though yeah yeah no no some things got to stay Gotta Stay private uh what what do you play I played a lot of piano growing up um and I play bass and I in a very mediocre way play guitar and drums yeah that's a that's a lot I cannot do any of those things so um yeah as Sean mentioned you you guys kind of burst into the scen as uh maybe the state-ofthe-art music generation company um I I think it's a model that we haven't rea

lly covered in the past so I would love to maybe for you to just give a brief intro of like how do you do music generation and why is it possible uh because I think people understand you take text and you have a predict the next word and you take a diffusion model and you basically like add noise to an image and then kind of remove the noise but I think for music it's hard for people to have a mental model like what's the how do you turn a music model and like what does a music model do to gener

ate a song so maybe we can we can start there yeah maybe I'll even take one more step back and and say um it's not even entirely worked out I think the same way it is in text and so um it's an evolving field if you take a giant step back I think audio has been uh lagging uh images and text for a while while so you I think very roughly you can think audio is like one to two years behind images and text and so um you kind of have to think today like text was in 2022 or something like this and um y

ou know the Transformer was invented it looks like it works but it's it's it's far far less established and so um you know I'll give you the way we think about the world now but just with the big caveat that that I'm probably wrong if we look back um in a couple years from now um and I think the biggest thing is you see both Transformer based and diffusion based models for audio um in and in ways that that is not true in text I know people will do some diffusion for text but I think nobody's lik

e really doing that uh for Real uh and uh so we we prefer Transformers for a variety of reasons and so you can think it's very similar to text you have some abstract notion of a token and you train a model to predict uh the probability over all of the next token so it's a language model um you can think in in anything a language model is just something that assigns likelihoods to sequences of tokens um sometimes those tokens correspond to text in our case they correspond to music or audio in gen

eral and uh I think we've learned a lot from our uh friends in the text domain from the Pioneers doing this of how well these Transformer models work where do they work where do they not work but at its core uh the way we like to do things with Transformers is uh exactly like it works in text let me predict the next tiny little bit of audio and I can just keep doing that and doing that and generating audio um as long as I want yeah I think the the Temptation here is to always try to bake in some

specialized knowledge about music or audio um and um so and obviously you will get an improvement in um in your output if you try to just say like okay like here is a set of notes for you know here's a here a set of tokens that only do uh jazz or only do you know like voices um how General do you make it versus how specific do you make it we've always tried to do things you know quote unquote the the right way which means that at the beginning things are going to be hard and worse than other wa

ys but that is to say bake in as little uh kind of implicit knowledge um as possible and so the same way you don't program into GP T you don't say this is a noun and this is a verb but it it has implicitly learned all of those things I've never seen GPT accidentally you know put a put a noun where it meant to put an article in English we try not to impose anything about music uh or audio in general into the model and we kind of let the models learn things by themselves and I think um things are

beginning to pay off but it's you know it's it's not necessarily obvious from the beginning that that was the right thing to do so for example you know you could take take something like text to speech and um people will do all sorts of things where you can program in things like phones to be the basis for what you do and then that kind of limits you to the set of things that are expressable by phones and so ultimately that works really well in the short term in the long term it can be quite lim

iting and so our approach has always been to try to do this in its full generality as end to end as as we can do it even if it means that um in the short term we were a little bit worse we we have a lot of confidence that in the long term that will be the right way to do it and what's the data recipe for turning a good music model like what percentage genre do you put like also do you split uh vocals and uh instrumentals so you have to do lots of things and I think this is the biggest area where

we have you know sort of our secret sauce I think um to to a large extent what we do is we benefit from all of the beautiful things people do with Transformers and TT and we focus very hard basically on how do I tokenize audio in the right way and uh without divulging too much Secret Sauce um it's a it's at least similar to how it's done in sort of the open source stuff you will have different models that learn to encode audio in in discrete representations and a lot of this boils down to figur

ing out the right uh let's say implicit biases to put in those models the right data to inject how do I make sure that I can produce kind of all audio arbitrarily that's that's speech that's background music that's vocals that's kind of everything to make sure that I can really capture all the behavior that I want to yeah that makes sense and then in terms of uh some of we had our monthly recap last month and uh the data wars were kind of one of the Hot Topics you saw the New York Times lawsuit

against openi um because you have obviously large language models in production you don't have large music models in production so I think there's maybe been less of a thade there so to speak um how how do you kind of think about that and there's obviously a lot of copyright free royaltyfree music out there um is there any kind of like power law in terms of like hey the best music is actually like much better to train on or like in music doesn't not really matter because the structure uh of you

know some of the some of the musical structure is kind of like the the same I don't think we know these things nearly as well as they're known in text we have some Notions of some of the some of the scaling laws here but I think um yeah we're we're just so so far behind you know what what I will say is that people are always surprised to learn that um we don't only train on music um and I usually give the analogy of some of the code generation models so take something like code llama which is as

far as I know the best open source code generating model you guys would know better than I would um is certainly up there uh and it's trained on a bunch of English um not only just code and it's because there are patterns in English that are going to be useful and so you could imagine you don't only want to train on music to get good music models and so for example one of the places that we are particularly bad is vocals and uh at capturing really realistic vocals and so you might imagine that

there's other types of human vocals that you can put into your model that are not music that will help it learn stuff um and so again I think it's like super super early I think we've barely scratched the surface of what are the right ways to do this um uh and that's really cool from a progress perspective there's like a lot of low hanging fruit for us to still pick and then once you get the final model I would love to learn more about the size of these models because people are confused when st

able diffusion is so small they're like oh this thing can generate like any image how is it possible that it's like you know a couple gigabytes and then the large language models are like oh these are so big but they're just text in them uh what what's it like for music is it in between and as you think about Yeah you mentioned scaling and whatnot is this something that you see it's kind of easy for people to run locally or or not our models are still pretty small um certainly by Tech standards

um I confess I don't know as well the state-ofthe-art on how diffusion models scale but our models scale similarly to to text Transformers it's like bigger is usually better um audio has a couple of weird quirks that so we care a lot about uh how many tokens per second we can generate because we need to stream You music uh as fast as you can listen to it um and so that is a big one that I think uh probably has us never get to 175 billion parameter model if I'm being honest um maybe I'm wrong the

re but I think that would be technologically difficult um and then the other thing is so much progress happens in shrinking models down for the same performance in text that I'm I'm hopeful at least that aot of our our issues um will get solved and we will figure out how to do better things with smaller models or relatively smaller models but I think the the other thing uh it's a blessing and a curse I think the ability to to uh add performance with scale it's like a very straightforward way to

make your models better you just make a bigger model dump more compute into it it's also a curse because that is a crutch that you will always lean on and you will forget to do some of the basic research to make your stuff better and honest ly um it was almost you know early early on when we were uh doing stuff with small models for uh kind of time and compute constraints um we ended up having to learn a lot of stuff um to make models better that we might not have learned if we had immediately j

umped to like a really really big model and so um I think for us we've all we always try to skew smaller to the extent possible yeah gotcha um I I'm I'm curious about uh just sort of your overall Evolution so far uh you know something I think we may have missed in the introduction is uh why did you end up choosing um you know just the music domain in the first place right like uh you have this uh pretty scientific um you know physics and finance backgrounds um how did you Wander over to to music

like a lot of us have interests in music but we don't necessarily choose to work in it and but you did yeah it's funny I I I have a really fun job as a result but um the all the co-founders of sunno uh worked at Keno together um and we were doing mostly text um in fact all text until we did one audio project that was uh um speech recognition for kind of very financially focused speech recognition and I think the long the long and short of it is we kind of fell in love with audio not necessarily

music um just audio and AI we all happen to be musicians and audio files and music Love but um it was the the like the combination of audio and AI that we like initially really really fell in love with it's so cool it's so interesting it's so human it's so far behind images and text that there's like so much more to do and um honestly I think a lot of people when we started a company told us to focus on speech um if we wanted to build an audio company everyone said you know speech is a bigger M

arket um and uh but I think there's something about music that's just so um human and so like almost couldn't prevent us from doing it like we we almost like we just couldn't keep ourselves from from from building music models and playing with them because because it was so much fun and um that's kind of what what steered us there you know in fact the first thing we ever put out was a speech model it was bark um it this op Source text to speech model and it got a lot of stars on GitHub and that

was people telling us even more like go do speech and like we almost couldn't help ourselves from from doing music um and so I don't know it's it's maybe it's a little bit serendipitous but um we haven't really like look back uh since I don't think there was necessarily like um in AA moment it it it was just like organic and just obvious to us that this needs to like we want to make a music company so so you do regard yourself as a music company because like as of uh last month you still releasi

ng speech models we were parakeet oh yes that's right uh so that's a that's a really awesome collaboration um with with our friends at Nvidia I think um we are really really focused on music I think that is the the stuff that will really change things for the better I think you know honestly everybody is so focused on on llms for good reason um and information processing and intelligence there and I think it's way too easy to forget that there's this whole other side of things that makes people

feel um and maybe that market is smaller but uh it makes people feel and it makes us really happy and um so we do it I think um that doesn't mean that we can't be doing things that are related that are in our wheelhouse that uh will improve things and so like I said audio is just so far behind there's just so much more to do in the domain more generally and so like that's a really fun collaboration yeah I I did hear about sunno first through bark um uh my sense is that uh like what did what did

bark lean off of like uh because obviously I think there was a lot of preceding TTS work that was in open source um how much of that did you use how much of that was like sort of brand new from from your research um what's the intellectual lineage uh there just just to cover out the the speech recognition St so it's not speech recognition it's it's Text to Speech but um as far as I know um there was no other uh certainly not in the open source uh text to speech that was kind of Transformer based

everything else was what I would call the old style of doing things where you build these kind of single purpose models that are really good at this one narrow task and you're kind of always data limited and the availability of highquality training data for text to speech um is limited and um I don't think we're necessarily all that inventive to say we're going to try to train in a self-supervised way a Transformer based model that on kind of lots of audio uh and then kind of tweak it so that w

e can do text to spe based on that that would be kind of the new way of doing things in a foundation model is the is the buzzword if you will and so you know we built that up I think from scratch a a lot of shoutouts have to go to lots of different things whether it's uh papers but also it's very obvious uh there's a big shout out to um Andre Kathy's nanog GPT um you know there's a lot of code borrowed from there um I I think I we are huge fans of that project it's just to show people how you do

n't have to be afraid of GPT type things and it's like um yeah it's actually not all that much code to make performant Transformer based models and you know again the stuff that we brought there was how do we turn audio into tokens and then we can kind of uh take everything else from the open source so um we put that model out um and we were I think pleasantly surprised by by the um reception by the community got it got a good number of GitHub stars and people really enjoyed playing with it beca

use it made really realistic sounding audio and I think um this is again the thing about doing things a quote unquote right way if you have a model where you've had to put so much implicit bias for this one very narrow task of making speech that sounds like words you're going to sacrifice on other things and in in the text of speech case it's how natural the speech sounds and um it was almost difficult to pull unnatural sounding speech out of park because it was TR self-supervised trained on a l

ot of natural sounding speech and so um that definitely told us that this is probably the right way to keep doing audio even in bark you had the the beginnings of Music generation like you could just put like a music note in there that's right and and it was so cool to see on our Discord people were trying to pull music out of a text to speech model and so you know what did this tell us this told us like people are hungry to make music and it's not um it's almost obvious in hindsight like how wi

red humans are to make music if you've ever seen like a little kid uh you know sing before they know how to speak you know it's like it's like this is really human nature and there's actually a lot of cultural forces that kind of cue you to not think to make music um and that's kind of what we're trying to undo and to dive into sunno itself I I think especially when you go from um text to speech people are like okay now I got to write the lyrics to a whole song it's like that's that's quite hard

to do uh versus Ino you have this empty box very mid Journey kind of like Del like where you can just Express The Vibes you know of of what you want it to be but then you also have a custom M you can say Your Own lyrics you can say your own Rhythm you can set the title the song and one on what are how do you see users distribute themselves you know I'm guessing a lot of people use the EC mode like are you seeing a lot of power users using the custom mode and maybe some of the favorite use cases

that you've seen so far on tuno yeah actually um more than half of the usage is uh that expert mode and people really like to get into it and start tweaking things and adding things and playing with words or line breaks or different ad Li and and people really love it it's it's it's really fun um so I think you know there's kind of two modes that you can access now one is that single box where you kind of just describe something and then the other is the expert mode and um those kind of fit nic

ely into two use cases the first use case is what we call nice posting and it's basically like something funny happened and I'm just going to very quickly make a song about it and the the example I I'll usually give is like I walk into Starbucks with one of my co-founders he gives his name Martin his coffee comes out um with the name margu and I can in five seconds make a song about this and it has immortalized it and that Margo song is stuck in all of our heads now and it's like funny and light

and there's levity that you've brought to to that moment and the other is that you got just sucked into I need there's this song that's in my head and I need to get it out and I'm going to keep tweaking it and listening and having ideas and tweaking it until I get the song that I want and um those are very different use cases but I think it ultimately there's so much in between these two things that is just totally untapped how people want to experience the joys of making music because those tw

o experiences are both really joyful in their own special ways and so I I we we are quite certain that there's a lot in the middle there um MH and then I think the the last thing I'll say there that's really interesting is um in both of those use cases the sharing Dynamics around music are like really interesting and totally unexplored and I think an interesting um comparison would be images like we've probably all in the last 24 hours taken a picture and texted it to somebody and most people ar

e not routinely making a little song and texting it to somebody but when you start to make that more accessible to people they are going to share music in much smaller groups maybe even not in all but like with one person or three people or five people and those Dynamics are so interesting and uh just I I think we have ideas of where that goes but um it's about kind of spreading joy into these like little you know microcosms of humanity that um people really love it uh so um I know I made you gu

ys a Little Valentine song right like that's not something that happens now because it's hard to make songs for people right we'll we'll put that in the in the audio here but I also tweeted it out if people want to look it up um how do you think about the pro market so to speak because I think um lowering the barrier to some of these things is great and I think when the iPad came out music production was one of the areas that people thought oh okay now you kind of have this like you know board t

hat you can bring with you and uh mad lip actually produced this whole album with him and Freddy Gibbs produced the whole thing on an iPad you never use a computer how do you see like these models playing into like professional music generation I guess it's also a funny word it's like what's professional music it's like it's all music if it's good it becomes professional if it's good right but um curious to see to hear how you're thinking about sunno too like is there a a second act of sunno tha

t is like going broader into like the custom mode and making making this the central hub for music generation I think we intend to make uh many more modes of interaction with our stuff but we are very much not focused on quote unquote professionals right now um and it's because what we're trying to do is change how most people interact with music and not necessarily make professionals a little bit better a little bit faster um it's not that that there's anything wrong with that it's just like no

t what we're focused on and I think when we think about what workflows does the average person want to use to make music I don't think they're very similar to the way professional musicians make music now like if you pick a random person on the street and you play them a song and then you say like what did you want to change about that they're not going to say like you need to split out the snare drum and make it drier like that's just not something that a that a random person off the street is

going to say they're going to give a lot more descriptive things about the thing about the the kind of the U of the song like something more General and so I don't think we know what all of the workflows are that people are going to want to use we're just like fairly certain that the workflows that have been developed with the current set of technologies that professionals use to make beautiful music are probably not um what the average person wants to use that said there are lots of professiona

ls that we know about using our stuff whether it's for inspiration or sample generation and stuff like that um so I don't want to say never say never like there there may one day be a really interesting set of use cases that we can expose to professionals particularly around I think like custom models for trained on custom People's music or you know with your voice or something like that um but the way we think about broadening how most people are interacting with music and getting it to be much

more active uh a much more active participant we think about broadening it from the consumer side uh and not broadening it from the producers from the professional side if that makes sense is the dream here to be I you know I don't know if it's um two course of a grain to to put it but like is is the dream here to be like the mid journey of of music I I think there are uh certainly some parallels there because uh especially what I just said about being an active participant mid Journey turns uh

the the joyful experience in mid journey is the act of creating the image and not necessarily the act of consuming the image and mid Journey will let you then kind of quickly share the image with somebody but I think um ultimately that analogy is like somewhat limiting because there's something really special about music I think there's two things one is that there's this really big gap for the average person between kind of their tastes in music and their abilities in music um that is not quit

e there in in for most people in in images like most people don't have like innate tastes in images I think in the same way people do for music and then the other thing and this is the really big one is that music is a really social modality um if we all listen to a piece of music together we're listening to the exact same part at the exact same time if we all look at the picture in alessio's background we're going to look at it for two seconds I'm going to look at the top left where it says Tho

r alesio is going to look at the bottom right or something like that and um it's not really synchronous and so when we're all listening to a piece of music together it's minutes long we're listening to the same part at the same time if you go to the act of making music it is even more synchronous it is the most Joy way to make music is with people and so I think that there's so much more to come there that ultimately um would be very hard to do in images we gone almost 30 minutes without making

any music on this podcast so I think maybe we can fix that and jump into a suno demo yeah let's let's make some um we've got a a new model that um we are uh kind of putting the finishing touches on and so I can play with it in our Dev server but we've we've just piped it in in here and as you can see been been doing tons of stuff so I don't know tell me tell me what kind of song um you guys want to make go on let you uh let's do um country song about the the lack of gpus in my club provider and

like yeah so here's where I we tempted to think about like pipelines and think about latency this is remarkably fast like I was I was shocked when I saw this oh my [Music] God my clown ready [Music] to but there ain't no [Music] DP just empty space it's a hoot I've been waiting all all day for that R power but my clouds gone dry it's a dark cloud show all CLS G noep to be found no it's so long this sound I just rer but my clouds got I actually don't think this one's amazing I'm go to the next on

e but it's funny that it knows about CA car well I signed up for cloud provider i' find all the power that I could when I search for the gpus I just got a surprise you see there all sold out ain't no gpus to find no gpus in the cloud it's real bad Clues I need power but there ain't no use I'm stuck with my CPU it's a real sad fight gotta wait till baby start getting Bri what else should we make all right Sean you're up I mean I I do want to like uh do some observations about about this um but ok

ay uh maybe like I like like like um like house music like electronic dance yeah house music um and then maybe we can make it about um I don't know podcasting about music and music AI generation I don't know I'm sure all the demos that you get are very meta there's a lot of there's a lot of stuff that's meta yeah for sure yeah I know I noticed for example that the second song that you played uh had the word upbeat inserted into it which I I assume there's some kind of like random generator of li

ke modifier terms that you can just kind of throw on to to increase the specificity of the what's being generated definitely and let's let's try to tweak one also so I'll play this and then maybe we'll tweak it with different modifiers through [Music] the re frequencies you that control never [Music] here's what I want to do that like didn't drop at the right time right so maybe let's do this I don't know if you guys can see this and then um let's get that get rid of the word mail and is that a

special token you have a beat drop token yeah nice I'm just reading it because people might not be able to see it and then let's like just maybe emphasize uh actually let's emphasize house a little more maybe it'll be a little more aggressive let's try this again it's interesting The Prompt engineering that you have to invent we've learned so much from people using the models and not us but like are these like art training artifacts no I I don't I don't think so I think this is people being inve

ntive with with how you want to talk to a model [Music] yeah sharing the peace spreading the word a reolution frequencies [Music] [Music] you from the be Dr to the about music [Music] for nice it's interesting when you generate a song it generate the lyrics but then if you switch the music under it like the you know the lyrics stay the same and then sometimes like feels like I mean I I'm most list into hip hop it's like if you change the beat you can not really use the same rhyme scheme you know

so definitely yeah it's a sliding scale though because you know we could do this as a as a country rock song probably right that would be my guess um but but for hip hop that is definitely true and actually you know we we think about for these models we think about three important axes we think about the sound Fidelity it's like does it sound like a separately recorded piece of audio we think about the song quality is this like an interesting song that that gets stuck in my head and we think ab

out the controlability like how well does it respond to my prompts and one of the ways that we'll test these things is take the same lyrics and try to do them in different styles to see how well that really works um so let's see the same uh I don't know what a beat drop is going to do for country rock so I probably should have taken that out but let's see what [Music] happens [Music] there's sound spinning around through the airast loud sharing the Beats spreading the water revolution of frequen

cies haven't you [Music] heard T now let the music take control we're on a journey [Music] never from the to the I'm gonna I'm G to read too much into this but I would say I hear a little bit of kind of electronic music inspired something and that is probably because beat drop is something that you really only ever associate with electronic music maybe that's reading too much into it but uh um should we do one more yes we can do one more something about Apple Vision Pro how I guess I guess there

's some amount of World Knowledge you don't have right like the whatever is in this language model side of the equation uh is not going to have an apple Vision Pro in there yeah but let's see um uh uh let's see uh how about a blue song about a sad AI wearing an apple Vision Pro gota be gota be gotta be sad do you have rag for music no that would that would be problematic also a broken [Music] heart where my app Vision can't see the stars I used to feel Joy I used to feel pain and now I'm just a

soul trapped inside this metal frame oh I'm singing [Applause] the can't you [Music] see life what it used to be searching for but I soul baby [Music] un I want to remix that one and I want to say I don't want really good voice I want love voice I want like I don't know Chicago like what is Chicago boo guar I know he knows too much he's a he's the best prompt engineer out here you know this is well it'll be funny it' be funny to that like musicologists uh play with this and see what they would h

ow embarrassing can I not do that oh I got oh the word Chicago was a trigger I don't know we we try to be we try to be very careful not letting you um impersonate and it is possible that's embarrassing so let's do uh [Laughter] [Music] Midwestern I'm a with a B heart where my Apple Vision Pro can't see the star I used to feel Joy I used to feel Joy I used to feel P but now I'm just a soul trapped inside this metal frame oh I'm singing oh can't you see this life Lo to be I'm searching for love I

can't find a soul oh you help me let my spirit there so yeah lot lot of control there maybe uh I'll make one more very very Soulful really want a good house track why is house the word that you have to repeat I just really want to make sure it's house um it's actually you can't really repeat too many times you kind of it gets like the hypothesis gets like a little too outed domain a broken heart where in my Apple Vision Pro can't see the stars I used to feel Joy I used to feel pain now I'm just

a soul inside this met frame oh I'm singing the oh can't you see what it used to be searching for love but can't find [Music] Soul nice so yeah we have a lot of fun definitely easy yeah yeah I'm really curious to see how people are going to use this to like resample old songs into new styles you know I think that's one of my favorite things about hip-hop you have so many I mean a trap call Quest they had like the L Reed Walk On The Wildside sample and like kind of kick it like Kanye sample Nina

Simone and like blood on the leaves it just like it's like a lot of production work to actually take an old song and make it fit a new beat and I feel like this can really help um do you see people putting existing songs lyrics and trying to regenerate them in like a a new style you know we we actually don't let you do that um and it's because if you're taking someone else's lyrics you didn't own those you don't have the publishing rights to those you can't remake that song I think in the future

we'll figure out how to actually let people do that in a legal way um but we are really focused on letting people make new and original music and I think you know there's a lot of music AI which is artist a doing the song of artist B in a new style you know let me have Metallica doing Come Together by The Beatles or something like that and I think this stuff is very viral but I actually really don't think that this is how people want to interact with music in the future to me this feels a lot l

ike when you made a Shakespeare sonnet the first time you saw chat GPT and then you made another one and then you made another one and then you you kind of thought like this is getting old and that's not that doesn't mean that GPT is not amazing GPT is amazing it's just not for that um and I I kind of feel like the way people want to use music in the future is not just to remake songs in different people's voices you lose the connection to the original artist you lose the connection to the new a

rtist because they didn't really do it um so we're very happy to just let people do things that are a flash in the pan and kind of stay under the radar yeah no that's a I think that's a good point overall about yeah generated anything you know um because I think recently um T pay he did like a an album of covers and I think uh he did like a war pegs that people really like there there was like a Tennessee whiskey uh which you maybe wouldn't expect T Pain to do uh but people like it but yeah I ag

ree you need to be a certain type of artist to really have it be entertaining to like make covers this is great uh what what else is next for foro you know I think people kind of saw you you know first you had the the bark and then there was like a big you know music generated um push when you did an announcement I think a couple months ago I I I think I saw you like 300 times on my Twitter timeline on like the the same day so it was like going everywhere uh what what's coming up what are you mo

st excited about in the space and maybe what are some of the most interesting underexplored um ideas that you maybe haven't haven't worked on yet gosh there's there's a lot you know I think um from the model side um it's still really early Innings and there's still so much lwh hanging fruit for us to pick to make these models much much better much much more controllable much better music much better audio Fidelity um so much that we know about and so much that um again we can kind of borrow from

the open source Transformers community that should make these um just better across the board from the product side and you know we're super focused on the experiences that we can bring to people and so um it's so much more than just uh text to music and I think um you know I'll I'll I'll say this nicely I'm a machine learning person but like machine learning people are stupid sometimes and we can only think about like models that take X and make it into Y and that's just not how the average hu

man being thinks about interacting with music and so I think what we're most excited about is all of the new ways that we can get people just much more actively participating in music and that is making music not only with text maybe with other ways of of doing stuff that is making music together if you want to be reductive and think about this as a video game this is multiplayer mode and it is the most fun that you can have with mus music and um you know honestly I think uh there's a lot of it'

s timely right now you know I don't know if you guys have seen umg and Tik Tok are butting heads a little bit um and umg has pulled to and you know the way we think about this is you know I think maybe they're both right maybe neither Is Right without taking sides this is kind of figuring out how to divvy up the current pie in the most Fair way and I think what we are super focused on is making that pie much bigger and increasing how much people are actually interested in music and participating

in music and you know as a very broad heuristic the the gaming industry is 50 times bigger than the music industry and it's because gaming is Super Active and music too much music is just passive consumption and so we are we have a lot of experiments that we are excited to run for the different ways people might want to interact with music um that is beyond just you know streaming it while I work yeah I I think at minimum you guys should have a twitch stream that's just like a 24-hour radio ses

sion that um have you ever come across Twitch Plays Pokemon no where it's kind of like the the twitch basically like everyone in the chat in the twitch chat um can vote on like the next action that the the game State makes um and they they kind of wired that up to a Nintendo emulator and played Pokemon like the whole game through uh the collaborative thing um it sounds like it it should be pretty easy for you guys to that except for the chaos that may result from but like I mean that's part of t

he fun I I I agree 100% sorry yeah the the like one of my like uh peeve projects or pet projects is like what does it mean to have a collaborative concert maybe where there is no artist and it's just the audience or maybe there is an artist but there's a lot of input from the audience um and you know if you were going to do that you would either need an audience full of musicians or you would need an artist who can really interpret the verbal cues that an audience is giving or non-verbal cues bu

t if you can give everybody the means to better articulate the sounds that are in their heads toward the rest of the audience like which is what generative AI basically lets you do uh you open up way more interesting ways of having these experiences and so um I think yeah I like the the collaborative concert is like one of the things I'm most excited about I don't think it's coming tomorrow but but we have a lot of ideas on on what that can look like yeah I feel like it's one stage before the co

llaborative concert um is turning um sunno into a continuous experience rather than like a start and stop motion um I I don't know if that makes sense um you know as as someone who with like a casual interest in DJing like like when do we see sunno DJs right like that um that that can continuously segue into like the next song the next song the next song I think soon and then maybe you can turn it collaborative you think so I think so okay maybe part of your road map you teased a little bit your

V3 model I'm just wondering like how you incorporate like user feedback right like we have you have the classic thumbs up and down buttons but like there's so many dimensions to the music like like you know I didn't I didn't get into it like some of the voices sounded more metallic um and some sometimes that's on purpose sometimes not sometimes there are kind of weird pauses in there I could go in and annotate it if I really cared about it but I mean I'm just listening so I don't but yeah no so

there there's a lot of opportunity we are only scratching the surface of figuring out um how to do stuff like that and um for example the thumbs up and the thumbs down for other things like sharing uh Telemetry on plays all of these things are stuff that in the future I think we would be able to leverage to make things amazing and then I I imagine a future where um you know you can have your own model with your own preferences and the reason that's so cool is that you kind of have control over

it and you can teach it the way you want to and you know the the thing that I would like in this to is like a music producer working with an artist giving feedback and like this is now a self-contained experience where you have an artist who is infinitely flexible who is able to respond to the weird feedback that you might give it we don't have that yet everybody's playing with the same model but I there's no technological reason why that can't happen in the future we had a few more from random

Community tweets I don't know if there's any favorite fans ofso that that you have or or whatnot dhh uh obviously notorious Tweeter and crowd uh INF flamer I guess uh he tweeted by you guys I saw Blau is a is an investor I think karpati also tweeted something return to Monkey yeah yeah yeah return to Monkey right is there a story behind that no he just he just made that song and it just speaks to him and I think this is this is exactly the the thing that we are trying to tap into that you can th

ink of it this is like a super super super micro genre of one person who just really liked that song and made it and shared it and it does not speak to you the same way it speaks to him but that song really spoke to him and I think that's so beautiful and that's something that you're never going to have an artist able to do that for you and now you can do that for yourself and it's just a different form of experiencing music um I think that's such a like a lovely a lovely use case and any fun uh

fan mail that you got from musicians or anybody that really was a funny story this year we get a lot um and it's it's it's primarily positive and I think um people people kind of on the whole I would say people realize uh that they are not experiencing music in all of the ways that are possible and and it does bring them Joy I'll tell you something that is really heartwarming is that um we're fairly popular in the Blind and vision impaired community and um that makes us feel really good and I t

hink you know very roughly without trying to speak for an entire Community um you have lots of people who are really into things like mid journey and they get a lot of benefit and joy and sometimes even therapy out of making images and that is something that is not really accessible to this fairly large community and what we've provided I I know I I don't think the analogy to Mid journey is perfect but what we've provided is a Sonic experience that is very similar um and that speaks to this comm

unity and that is community with the best ears the most exacting the most tuned um and so uh yeah that that definitely makes us feel warm and fuzzy inside yeah excellent uh I mean there's it look sounds like there's a lot of uh exciting stuff on your road map uh I I'm I'm very much looking forward to sort of the the infinite DJ mode because then I can just kind of play that while I work um I would love to get your overall takes like kind of zooming out from sunno itself uh just the overall takes

on the music generation landscape like what should people know um I think um you obviously have spent a lot more time on this than others um so in my mind you you shot out Vol and the other sort of Google type work in your uh in in in your read me in in bark um what should people know about like what Google is doing what meta is doing Meta Meta released um seamless recently an audio box um and what are the other how do you classify the world of audio generation like you know in the broader sort

of research community I think um people largely break things down into three big categories which is uh music speech and sound effects there's some stuff that is crossover but I think that is largely how people think about this the old style of doing things still exists um kind of single-purpose models that are built to do a very specific thing instead of kind of the the New Foundation model approach um I don't know how much longer that will last I don't have like tremendous visibility into you

know what happens in the big Industrial Research Labs before they publish um specifically for music I would say uh there's a few big categories that we see there is license-free stock music um so this is like how do I background music the b-roll footage for my YouTube video or for full featured production or whatever it is um uh and there's a bunch of companies in that space there's a lot of AI cover art so how do I have how do I cover different different existing songs um with AI and I think t

hat's a space that um is particularly fraught with some legal stuff and we also just don't think it's necessarily the future of Music um there is kind of net new songs as a a new way to create net new music that is the the the corner that we like to focus on um and I would say the last thing is much more geared toward professional musicians which is basically AI tools for music production and you can think many of these will look like plug uh to your favorite Daw um some of them will look like y

ou know the the greatest stem splitter um that the market has ever seen um the the current stem Splitters are the the state-of-the-art are all AI based that is a market also that has a just a tremendous amount of room to grow if if you just think about I would say music has evolved somebody told me this recently that if you actually think about it music has evolved um recently it's just much more things that are sonically interesting at a very local level and much less like chord changes that ar

e interesting and when you think about that like that is something that AI can definitely help you make a lot of weird sounds and this is nothing new there was like a theramin at some point that people like put an antenna and try to do this with and so like I think this is just a very natural extension of it um so that's how that's how we see it at least um you know there's a corner that we think is particularly fulfilling particularly underserved um and particularly interesting and that's the o

ne that we play in awesome yeah um it's great I know we covered a lot of things I think before we wrap uh you have written a block post that can show about uh good Hearts law impact inl which is you know when you measure something then the the thing that you measure is not a good metric anymore because people optimize for it any thoughts on how that applies to like llms and benchmarks and kind of the world what we're going in today yeah I mean I think it's maybe even more appropo than than when

I originally wrote that because um so much we see so much noise about uh pick your favorite Benchmark and this model does slightly better than that model and then at the end of the day actually there is no real world difference between these things and it is really difficult to Define what real world means and and um I think to a certain extent it's good to have these objective benchmarks it's good to have quantitative metrics but at the end of the day you need some acknowledgement that you're n

ot going to be able to capture everything and so um at least atso to the extent that we have corporate values if we don't we don't have we're too small to have corporate values written down but something that we say a lot is Aesthetics matter that the kind of quantitative benchmarks are never going to be the Beall and end all of everything that you care about and um as flawed as these uh uh benchmarks are in text they're way worse in audio and so um Aesthetics matter basically is a statement tha

t like at the end of the day what we are trying to do is bring music to people that makes them feel a certain way and effectively the only good judge of that is your ears and so you have to listen to it um and it is it is a good idea to try to make better objective benchmarks but you really have to not um fall prey to those things um I can tell you you know I it's kind of a another pet peeve of mine like I always said economists will make really good or do make really good um machine learning en

gineers and it's because they are able to think about stuff like good heart's law and natural experiments and stuff like this that people with machine learning backgrounds or people with physics backgrounds like me um often forget to do and so um yeah I mean I I'll tell you at kenell we actually used to go to big uh econ conferences sometimes to recruit and these were some of the the best hires we ever made interesting because there's a little bit of social signs in the human feedback I and that

's I think it's not only the human feedback I think you could think about just in general you have these like giant really powerful models that are so prone to overfitting that are so poorly understood that are so easy to steer in One Direction or Another Not only from Human feedback and your ability to think about these problems from first principles instead of like getting down into the weeds or only math and to think intuitively about these problems is really really important I I'll give you

like just like one of my favorite examples it's a little old at this point but if you guys remember like Squad and Squad 2 the question answer in data set the Stanford question answering data yeah The Benchmark for squad one um eventually the the machine learning models start to do as well as a human can on this thing and it's like uh oh now now what do we do um and it takes somebody very clever to say well actually let's let's think about this for a second what if we presented the machine with

questions with no answer in the passage and it immediately opens a massive gap between the human and the Machine and I think it's like first principles thinking like that um that comes very naturally to social scientists that does not come as naturally to people like me um and so that's why I like to hang out with people like that U well I'm sure you get plenty of that in Boston uh and as a econ major myself I'm you know this very gratifying to hear that we have a perspective to contribute oh bi

g time Big Time I try to I try to talk to economists as much as I can excellent awesome guys um yeah I think this was this was great we got live music we got discussion about gened models we got the the whole n9r so thank you so much for coming on I had great fun thank you [Music] guys

Comments

@JoelMorton

Suno should be looking into gamification, or integrating a social media platform to allow public user profiles, songwriting contests/challenges/leaderboards, workshops/tutorials once V3 gets out of testing. Such potential for a community based around celebrating musical creativity.

Making Transformers Sing - with Mikey Shulman of Suno

Related articles

Comments