How to deploy real-time analytics in 10 minutes with DoubleCloud's Managed Kafka

Okay, so yet again, thanks for joining this another webinar, and today we'll be talking more about Kafka, specifically real-time analytics. So let's jump into the webinar actually. So in today's fast-paced digital world, like this need of real-time streaming. So be it like social media, IoT, healthcare, finance, the data that's flowing in constantly needs to be validated or there needs to be ability to monitor or react to these events that are happening in real time. Like all the social med

ia click streams that you're doing in the back end, the company is doing some analytics based on your likes, your dislikes, and what are you sharing, and what's the persona of the specific user it is. And also in financial, it could be like fraud detection or something like that. But yeah, let's talk more about it in the coming slides. But before that, a little bit about ourselves. Like myself, Deepan, I'm a senior product manager here at DoubleCloud, responsible for all the products or f

eatures catering towards the customers. And me is Andrei. I'm an engineering lead. I'm responsible for our transfer capabilities and also the common infrastructure in the Terraform as well. So I have a huge experience with Kafka. So why I'm sitting here and explain quick why Kafka for Real-Time Analytics is a really good tool and how to manage it properly. Yeah, and last but not least, DoubleCloud. So DoubleCloud is a platform which you can build an end-to-end data analytics solution using

the proven open-source technologies. You will hear or we'll talk more about it on DoubleCloud architecture and what it does in the upcoming slides. So let's proceed. Real-time streaming, so real-time streaming is like the base. Like for example, there are different scenarios like as I mentioned, like social media. It has become very crucial for businesses to have all this information available at the fingertips. Making purchases online or monitoring IoT devices. Hey, this sensor has been so

mething. The values that's coming from the sensor is something wrong. What is happening? How can we take some actions and all these things happen. So what is behind it is like the real-time data streaming is the force behind it. So it creates this very seamless experience and also it's like the game-changer for business. So they allow this optimizing like things on the fly, making decisions on the fly. Like you can add some points to it here, Andrei Yeah, like, I also want to add some point

s here. The real-time of this stuff is usually crucial because real-time analytics is not just about analytics but some calculation and business logic because it's not about the classical dashboard stuff. Analytics is more about the ability to react on certain events and the ability to automate certain things. Usually, real-time analytics is a buzzword but if you need to react in a matter of seconds then you have a real-time stuff from the actual event appears somewhere in the system, whate

ver, to the actual reaction. If the time span is about seconds then it's a real-time stuff. Yeah, I think speaking of this fastness and times that's required, that's where our powerful engine comes in or the powerful framework you call it like Kafka. So we explore the reason behind like the Kafka because is Kafka good for real-time data? I would say yes. Kafka has been widely recognized as an excellent choice for real-time analytics due to that Kafka right now is more or less industry stan

dard for anything related for real-time and it's actually industry standard for building a message bus. Back in the days like 20 years ago, we have different message buses but right now we have also different message buses but Kafka is a default one. If you're talking about the message bus, message queue usually you're referring to Kafka. Kafka here is a nice, mostly because of their matureness and this matureness come with a very stable core product messaging stuff and a lot of integratio

n capabilities because this product on the market for a while, we have very interesting integration already in open-source world, developed and maintained and we have a huge expertise outside of the world about the how to exploit Kafka, how to build a nice product on top of it. For example, we have Kafka Connect stuff for connecting Kafka VSyn. We have all necessary libraries for writing Kafka applications. We will have several sort of libraries and applications that can create a charts on

top of a Kafka with easiness. And another thing with Kafka you need to understand that Kafka is a stable product. It does not evolve quickly but on a cont it has extremely high quality and robustness into it because there are no major bugs and issues in the Kafka for a while. And Kafka is already designed for being a large distributor systems. It was designed in such a way in the first place but right now it's more mature one. It allows you to build a message bus which is full tolerance fo

r any kind of events, like starting from a virtual machine deaths and a disc outages ended up with a whole region downtime. Another thing that Kafka is really good at is a throughput. Kafka designed internally quite nice for throughputs. It allows you to write light of data into it and read a light of data from it. So this is actually a nice characteristic for Kafka and more or less if you have a chance to think about real-time anything in the modern software world, you most probably alread

y heard about Kafka and most probably already touched some Kafka in the past. Yeah, that's 100% right. Like from the technical aspect, like what you mentioned about throughput, for tolerance, or scalability, all these are like the top-notch capabilities, I would say, like from Kafka. But like let's also touch base a little bit on like the use cases like, okay, Kafka is good, it's fine, but what are what type of use cases customers can actually leverage Kafka within their application? So so

me of which that we kind of identified or it's kind of like market standard, like pretty much most of the customers are using it for its real-time dashboarding, right? Like all this, what Andrei mentioned, like what you mentioned is true. Like we kind of aggregate all these data sources in a real-time fashion that's coming in like from different sync connectors or like not from different sources and then actually then putting in like a dashboards also coming in from microservices or some m

onitoring tools and also also useful for the fraud deduction in financial world or making some market analysis that's all this data is coming in like real-time in a way, right? And it's also not about just a business cases some in in a very I have a huge experience of using Kafka for a developer experience stuff mostly for logging and monitoring inside of applications. If you have a huge fleet of some application running somewhere having a Kafka for aggregating call logs and analyze them fo

r outages is also really nice thing. So real-time dashboards and reporting, is not about just a business, but also a technical stuff you can do you can build a sophisticated monitoring capabilities on top of a Kafka if you aggregate all your Kafka all your logs of your applications into singular Kafka for example. And this is like not cheap but more open-source wise alternative for such tools like a data dog or maybe splunk. Yeah, that makes sense. But speaking of all this capabilities that

you mentioned, there are definitely there are challenges, right? Because it seems to be like a very huge tool. It's a robust tool, but there are again challenges in it. Like, how can we compare this self-managed one side? Because it's an open-source, definitely anyone can just spin up a Kafka cluster and they can manage it on their own. And on the other hand, there's also proprietary solutions where they offer different solutions or flavors of Kafka. So yeah, Kafka is a really nice tool i

n terms of functionality and actually in terms of exploitation as well. You can spin up a new instance of Kafka for yourself on your own virtual machine rather easy and this easiness may lead to a false confidence about that Kafka exploitation is easy. Exploitation of Kafka, especially a huge installation of Kafka, is never easy. You have to care about a lot of stuff. First of all, the distributed nature of Kafka leads to an AO a need for some automatic scripts. Sometimes Kafka can need to

do some manual semi-manual works, especially for rebalancing stuff and making sure that your Kafka is well performed actually is about how balanced it is. The Kafka itself can have problems if you have reading data in not optimal one and sometimes you need to do some topic works with a rebalance stuff. This is an important piece of work. Another thing that you need to care about self-managed installations is monitoring and alerting and being able to fix problem quickly. Kafka is a stable

tool but sometimes you have outages that you need to handle manually or semi-manually. The most classic one is disk outages. If you have, for example, installed your Kafka on a physical machine, the disk may went off or maybe machine itself went off. This stuff is need to hand manually semi-manually. You can do some automation on top of it but this actually requires huge work. There is no like well-known deployment scripts for that fits all the needs. There is like several open-source vers

ion how to manage and deploy Kafka with a set of administration script but it's usually require some work. For my experience, I have like experience of exploiting Kafka on thousands of nodes. This usually require a full team of Engineers to babysitting such huge installation of Kafka. How was your experience? You had to work on weekends to manage it or well, I don't have, I have a shift for duties and these duties sometimes spend on weekends. And since this Kafka was essential piece of org

anization, it was a message bus for all events. The alerts may appear in night for example. So you have to wake up at night, try to figure out what went wrong and then realize that 23 disks on hundreds machines went off and you need to replace them and you need to do something with this and this actually pain. So all that activity that you had been doing was also like in manual activity right? So it's like you have to in terms of scaling or auto-balancing, all if you need to scale this up

you need to add this you need to CL your cluster. It's also semi-manual work you can automate this but you still require essential and experienced engineer behind this. Another thing but this self-managed installations usually come with some benefits. For example, you can do whatever you want with your cluster you can install whatever extension you want and you can do whatever you want on in terms of a fish that you want to enlarge on this Kafka. And on the counterpart of a self-managed st

uff you can actually buy the Kafka on sort of vendors there is a plenty of supplies on a cloud for Kafka things some of and all of them have several pros and cons for sure. In most of cases Kafka that provided by properties renders have a subset of features of Open Source Kafka most probably this subset of features is a subset not full set of features because of a security reasons Kafka itself is not just a message bus it's also a framework you can run a code inside of it and this code is a

ctually a potential vulnerability if you're talking about this in terms of a managed solution. So managed vendors usually... And also costs, cost is also important right? and also cost for the customers installations, usually cost more, which is reasonable, and sometimes they even have a different IPs and maybe a on top of a Kafka that actually leads to a certain code you written and making your application bounded not just a generic Kafka but for certain flavor of a Kafka which actually Di

rect road to vendor locking another thing is crucial for proprietary resolution is how to connect your exist infrastructure with a Kafka because usually Kafka is not just flying somewhere in a cloud in a different scope it's usually bounded to your existing infrastructure you need to write into the Kafka you need to read from the Kafka and you integrate your application with the Kafka and then it actually leads to a need Kafka could be well connected well integrated to existing and yeah tha

t's basically a trade-off you can either go with a self-managed solution and install it on your own in your own infrastructure and have a full feature set of a Kafka but it's come with a cost you have to exploit it you have a team of Engineers that take a look on it and you have to build your own semi-automatic semi-manual infrastructure for handling certain events another option is go with a proprietary solution you can go with a managed Kafka and this also come with the cost you have a bi

t less flexibility sometimes cost is higher well most probably cost will be higher than just bare-metal installation in terms of infrastructure, not management costs, and sometimes features are less juicy and sometimes less straightforward approach to integrate your existing infrastructure to Kafka. This is a tradeoff for you. Makes sense. I think that's where our next slide we're going to talk about how we're going to overcome these challenges or how what's offering the DoubleCloud is off

ering like DoubleCloud manage Kafka. So within DoubleCloud, we offer a managed Kafka service. What makes it interesting is like all these capabilities what Andrei mentioned like in the self-forced way we all do it for like managed version like we have the enhanced monitoring capabilities so we have different integrations also with external logging providers and metrics with Prometheus endpoints and also integration with the data dog and S3 where you wanted to store your logs or also on you

can just take an example of bring your own cloud account so which is very good like we have seen like customers asking it's like okay fine that you guys are managing but is there possibility that can we have this data for privacy reason and also for governance reason they wanted to run it within their own region or within their own cloud account another reason is also because they wanted to have a centralized billing for in the cloud rather than to have two different vendors to manage it so

that's why we have a bringer on account or Cloud offering where you can currently we support AWS and GCP so Azure is still on the way and also the support of CPU types like x86 and arm board also drastically gives a very boost in performance in terms of the price and the performance it depends on the use cases like if your use case is very light use cases so you can actually choose between the different CPU nodes as well and in terms of a developer experience I think Andrei that's your exp

ert like I would add that based on my previous experience of managing huge classes of a Kafka usually if you have a Kafka cluster you have to manage it somehow the CF itself provide you some managing capabilities but having this via API or UI is very good point first of all easiness of adding new topics new users and manage these topics is really good things for having managing this we clear or API with some B script sometimes is not so straightforward and actually may lead for some disaste

rs we have in my past experience disaster scenario when the some analytics around the incorrect scripts for topic reassignment and reassign all cluster topics and having this hidden behind API or UI with defined permissions is a nice thing another thing which is really essential is having your all infrastructure written in a certain way right now this is more more more or less default goto approach with your current infrastructure for building stuff is using approach infrastructure as a co

de so you're not just clicking stuff on the web page for creating your infrastructure but creating the set of terraform scripts terraform application to spin up all necessary entities inside of a cloud and having a terraform provider for your cluster for your integration stuff for your networks is a really good thing so it become your infrastructure become very manageable which is good and also it become reproducible so you can create your own Dev environments developer STS or something lik

e this with a matter of a minute just run a terraform apply a different account offer different blics another thing is which is really important for Kafka specifically is capabilities of a Kafka connect Kafka connect is a very powerful framework but in some cases it actually in some implementation for Kafka it's either forbidden or limited we have pre-selected set of connectors and polish them properly to make you easy to run connector for popular one for us is right now mirror making for m

irroring you exist Kafka to DoubleCloud Kafka and another thing is a Strekafka connect which allow you to dump your Kafka topics into the S3 buckets for later analysis or backups etc. And another thing that we have, DoubleCloud itself is not just about managing databases or clusters; it's more like a data platform. So we have, in out-of-the-box integration with other of our tools. Right now we have comprehensive tools for data analytics with ClickHouse. ClickHouse is a very good database fo

r real-time analytics and with pairing with Kafka, they shine even more. Kafka accepts all necessary data on a flight and then just seamlessly lends this into the ClickHouse, and then you can analyze it with a matter of millisecond subsecond analytics is easiness with this Kafka plus ClickHouse. And another thing that I want to highlight here is that we care a lot about the monitoring and the observability of our existing infrastructure of our customers' CCAs. So we provide the monitoring

stuff and we pre-select necessary sensors that would be very interesting to use such as topic size, most popular, most highly low topics, amount of bytes in message in, and etc. Like this, essential monitoring capabilities, and having them is really nice touch especially comparing with a B installation. With B installation, usually you don't have such things out of the box, you need to set them up, and this setup-ness is usually a painful process. I do remember how I set up everything for

self-managed CCAs and usually, monitoring stuff is like a full-time job for some engineer, like having this monitoring set up and running and finding that it's not so, it's working correctly, it's a very big deal. Yeah, that's actually what we offer from our own cus, all of this stuff is already there, you can actually try it with a matter of minutes especially with the power of Terraform, and also all these capabilities and also on top of it cost is also more efficient than any other Kafka

clusters in the market so you can definitely have a look at it. You can also use a pricing tool to have a look and validate on specific node sizes or specific CPU types that you want for your scenario. I highlight one thing that we support ARM processors. ARM architecture is in terms of cost performance much better than classical architecture. We spent a decent amount of time to prepare C for running for this processor and unit for this kind of architecture but once we finish this we hav

e really nice results in terms of a price-to-performance ratio and this is really good. Yeah, I think moving ahead, I think what Andrei mentioned we'll touch based on that on out-of-the-box integration. So DoubleCloud platform itself is actually more like these different blocks so like we have, it's an end-to-end platform for building your faster analytics. So we have managed Kafka clusters, we have managed ClickHouse data store, and also we have managed airflow so all these things put tog

ether it makes it like one single platform where you can manage all your analytical needs right from the data source inject. That's where a DoubleCloud transfer also comes in where you can just pull data out from different data sources and then make it available for faster analytics in Kafka or ClickHouse or any other specific scenarios. This capabilities of integration also gives you a Kafka a nice angle for extra features. For example, you can see your Kafka with your existing sources vi

a transfer and connect Kafka with transfer with ClickHouse and build dashboards not just for real-time dashboards but also historical dashboards and these dashboards would be showing you a whole picture for what happens inside of organization. Transfer not just can be a reader part of Kafka but also a writing part of Kafka. You can write data from via transfer into Kafka, you can offload certain activities like you can upload PostgreSQL transaction log into Kafka and then analyze it in Clic

kHouse or your own code, or for example, you can upload advertisement sources for example Facebook marketing cat for later analysis on a regular basis and this is actually also a really nice option. Yeah, I think that's where we wanted to touch base on our other service at ClickHouse. So why choose ClickHouse for real-time analytics? Because ClickHouse is like the open-source data platform, data warehouse actually, it's properly known for its real-time capabilities and its ability to swiftl

y process queries and efficiently manages all storage engine. It's like a perfect solution for managing data in real-time scenarios, and the fact that, yeah, please go ahead, yeah, I would couple things that ClickHouse right now is very, is very pop open-source analytical database. It has some flaws in terms of injection side and that's why Kafka plus ClickHouse is a perfect combination. Writing in a ClickHouse usually is not so easy if you just starting to onboard in a ClickHouse world. It

's a first thing is that you will realize that writing in a ClickHouse you need to be aware about ClickHouse implementation, that's why having Kafka before the ClickHouse is very good. Kafka with a proper connector with ClickHouse, which we already have in transfer, can minimize hustle for injection for ClickHouse, and the ClickHouse allow you to write a queries that analyze your data in milliseconds and can handle a really significant load so you can actually do real-time and user analyti

cs so it can handle thousands of queries, analytical queries per second, without significant performance degradation. I think this is where we talk about this specific scenario what Andrei mentioned like Kafka plus ClickHouse is like the best combo or they both go hand in hand like for any real-time names actually. The ClickHouse also has its inbuilt Kafka integrations but what we offer from transfer is more robust and because it isolates this process of data injection and makes it ClickHou

se more faster for just for analytic needs so I would say like all these blocks what you see like managed Kafka transfer and ClickHouse everything is managed under the DC platform itself offering a whole end-to-end analytics yeah, this is a whole scenario you can run it rather easier with Terraform you can just take our example or write your own put it inside of your infrastructure but with your VPC connection and run the for example IoT analysis scenario out of the box with rather simplici

ty. The CF itself allows you to expose in double cloud the REST API, REST proxy for CKA which simplifies writing part for Kafka as well because Kafka itself is a very nice tool but the protocol Kafka protocol sometimes can be tricky especially if you have rather simple devices so having the REST proxy here is also a nice option in such scenario you can easily build the event stream pipeline with DEH house attaching to it. You have application that right into breast proxy over CFA Pi the da

ta into Kafka this data GL into the data house which is a ClickHouse and then you can visualize dashboards on top of it or analyze it with your code and actually this allows you to read the same topics with your application and build sophisticated products on top of it with a Sparks, or even your custom Python code and also there's other use cases, also, Andrei mentioned is about observability, right? Like to aggregate your logs internally. I think this is a kind of scenario what we are usi

ng as well and yeah, right now we using this a lot internally in DoubleCloud, the whole DoubleCloud stack, technology stack used for our observability platform so we provide for actual end users the logging capabilities for each cluster, for example. This is a nice feature but this feature is work on top of our existing technology so the logs itself aggregated from clusters we use for this a vector Dev which is a nice tool for creating logs and then we deliver this logs to our own Apache Ka

fka installation, this CFA installation. Then this Kafka topics with logs aggregated, analyzed and passed and delivered into ClickHouse. We in ClickHouse we have the tables with the logs for all for our clusters and then we use this for both analyzing with the raana for our duty shifts and providing this for customers for their need so they can actually read the logs for their own Kafka installation and this Kafka installation logs come from the very same Pipeline and building such pipeline

usually not so expensive we we spend for this rather few amount of money but what this actually gives you it gives you a maximum freedom in terms of how to use your observability data and it's extremely scalable because Apache CFA can handle a big cloud it can scale from several kilobytes per second of data to a gigabytes of data gigabytes per seconds of data in a matter of like hours maybe minutes so it can be scaled up and scaled down rather easily The ClickHouse itself actually can han

dle a significant load and in terms of cost per performance ratio is also really good tool and um we actually use this tool as a replacement for our own old data do observability stack and this actually give us a huge cost optimization because data usually is extremely it's very nice tool for and also is also quite popular in obscurity right which is also with this kind of setup you can already see like it's very 3x time less expensive than all this ELK stack it's like in a matter of not no

t not like three time in some scenarios it can be in a 10 times less expensive even more because the current market tools for observability is extremely expensive and also we have seen like customers have actually moving away from this ELK stack and having this kind of setup with ClickHouse specifically for observability where they want to aggregate all the logs and have this capability and having the open source the core because ClickHouse and here the open source Solutions gives you a rat

her big variety of choices how to tune the subsuperability freer because you have the ClickHouse connected for Grafana if you want to build a dashboard in Grafana you have capabilities with like high different libraries for logging analyzing more monic stuff all of this is already developed but by some nice guys who using ClickHouse for their own needs and it's a plenty of of options in open source world for observability stack on top of a ClickHouse nowadays it's popular one Mak sense I th

ink we let also talk about and an upcoming webinar where you'll get more information about this availability and monitoring so please sign up look into our website or can scan this QR code to sign up for a webinar about specifically on Absol and monitoring if you're more interested moving ahead so this is a small like one of the customers use case or customer stories. I wanted to mention more about like L sports for them the latency was a very very key key requirement because it's like a sp

orts betting company where they wanted to have all this data it's been coming in like as fast as possible with minimum latency and the solution went ahead with also like DoubleCloud and they are pulling data out from Kafka and using transfer service and using ClickHouse to do all the subsequent latency what you had seen in the previous block is it's part of the blocks that they are already using it like using the transfers and ClickHouse to actually get the benefit out of the platform itse

lf and also the subsequent latency to solve the problem actually which is quite cool and they were able to also manage it within like in shorter span of time which is also like helping in terms of a time to Market from company standpoint like wanted to also achieve that rather than building everything in house it will take a lot of huge efforts and time but rather than relying on such platform really created a lot of value for such customers and moving ahead so yeah I think we're almost at

on time so if there's any questions please feel free to drop in chat or you can also reach out to us through our website where you can if you have any use cases on real-time streaming observability or if you feel like you need to incorporate Kafka somehow in your organization or looking for some migration scenarios from your existing Kafka because it's expensive or for some other reasons please yeah reach out to us and we'll be able to support you in that case. is there any questions Andrei

? This, yep, the Terraform stuff, uh yeah, the Terraform. I think it's, uh, yeah, the question is about Terraform, the generic one. Uh, the question about Terraform, the interesting part that Terraform is quite a powerful tool. Uh, it allows you to configure everything as code. You just write, for example, a Terraform configuration for a Kafka cluster and commit it to some Git repository, then apply it to your infrastructure, and you have like a cluster. This is a pretty straightforward app

roach, for example, in preparing for clicking it in your browser. But, uh, what it actually gives you is the ability to maintain it. So if you want to change something, you can change it in this code. And if you want to integrate this with something, you can use this Terraform resource and pick some fields and properties from it. For example, the classical use case for us, can you scroll down, Jan, for, uh, I think the, yeah, this, this, let's see, like next slide, with observability becau

se it's more complex. So if you want to set up everything in one place, the form here is exceptionally good because you need to deploy, for example, Fluentd configuration to collect the logs. Let's imagine that you have this Fluentd deployed somewhere in the Kubernetes and it's already there. You already have a Helm chart that applies the existent clusters this configuration, but you need to pass a password and the hostname and username from newly created Kafka to this Fluent configuration.

What you can actually do is Terraform, which creates the resource for Kafka in the cloud and use this resource properties like username, password, and host to pass it into your configuration of clusters. And this bounds your infrastructure in one piece. And the single Terraform apply, Terraform execution, allows you to set up not just the cluster, but the integration itself. And also, you can manage topics via Terraform, which is also nice. You can create a topic via Terraform and pass it

to your application or your integration or something like this. This ability is open a very big options, huge variety of options how you can use a tool and how you can make this tool more efficient to you. Okay, is there any other questions? Uh, how DoubleCloud is different from other providers? Oh, I think I can, uh, we can. So actually, okay, sorry, sorry. So, the capabilities what we mentioned here is something like we have, like, the unique capabilities what we have and also the out-of-

box integration is also crucial because if you're going with Kafka, you're not just going with Kafka. So you definitely need something, let's say for real-time analytics, you need a faster data store like ClickHouse or something. So in that case, we offer this out-of-box integrations as well and lot other capabilities like Terraform integrations. Bring your own cloud account is also very, very commonly asked feature from our customers like they wanted to manage it themselves and on top of

it, we are also most cost efficient in terms of, and also the support of the CPU types that you see, x86 and ARM support also makes it a bit more like different and it's more interesting and other than that what we already talked about is monitoring and other capabilities are pretty much same across things but what we offer in enhanced monitoring is also additionally like exporting these logs to Datadog or allowing these metrics to be exposed by Prometheus endpoint so all this makes combine

d as a whole makes it more interesting and yeah, that's how we kind of see it's different. I would just highlight one more thing that the out-of-box integration with existing data pieces in DoubleCloud is a really good thing that opens you a lot of capabilities to enhance your Kafka setup and enhance your data pipeline at home. okay, if there's no more questions, so yeah, please feel free to get in touch with us. Like if there's any specific use cases on real-time streaming Kafka or ClickH

ouse or real-time analytics observ anything, just reach out to us and we'll be able to get back to you and we can discuss more on it and by the way, you can also sign up for a free trial using Double platform where you can just spin up a Kafka cluster or ClickHouse cluster and then play around with it. So we also give a free credits for the free trial and yeah, so feel free to use it and see you another webinar. See you guys, thank you.

How to deploy real-time analytics in 10 minutes with DoubleCloud's Managed Kafka

Related articles

Comments