sharetwitterlinkedIn

Taking messaging and data ingestion systems to the next level

July 13, 2020
head img
                                                       

In this podcast, ​Sijie Guo​ chatted with Ben Lorica on how Apache Pulsar is able to handle both queuing and streaming and both online and offline applications.

This conversation spanned many topics, including the following items:

  • The role of messaging in modern data applications and platforms.
  • The two main types of messaging applications: queuing and streaming.
  • Apache Pulsar as a unified messaging platform, able to handle both queuing and streaming, and both online and offline applications.
  • A status update on Apache Pulsar.

Below is the full transcript of this conversation.

  • Ben: ​Sijie Guo, the founder at StreamNative, welcome to The Data Exchange Podcast.
  • Sijie: ​Thank you, Ben. It's nice to have a chance to talk about StreamNative and broad messaging technology.
  • Ben: ​Today we're going to talk about streaming in general and messaging in particular, but before we jump in, let's talk about your background a bit. I know you were at Twitter, can you share a little bit of your background and experience?
  • Sijie:​ Yes. I was at Twitter for about six years, leading the messaging group, building, and maintaining the messaging platform within Twitter for its real-time and streaming infrastructure. Before Twitter, I was a part of the back end group at Yahoo, working on push notification. Those are the two main jobs I’ve had working with messaging streaming infrastructures.
  • Ben: ​So, Sijie, the reason I wanted to have you here is because we are, of course, in a world where machine learning and AI are very much important and top of mind for many people and many companies. But these technologies rely on data, and collecting data is an important component in making sure you have good AI applications. So, at a high level, what exactly do you mean when you say messaging?
  • Sijie: ​Broadly, messaging is about the communication between services. For example, in a data pipeline, a messaging system serves the purpose of taking data from different sources into your platform so your data scientists can process it. It's serving the purpose of communicating or collecting the data from one end to the other end. So that is a broad definition.
  • Ben: ​The important thing here is that you want a system that can absorb data in real-time, maybe in high volume, and that always stays up, right?
  • Sijie: ​Yes, exactly, because your data sources would have different types of workload, and they usually arrive in large volumes. For example, click events, or impressions, account for a huge volume of traffic, and you want your system to be able to absorb the peak traffic.
  • Ben: ​These messaging systems and messaging infrastructure will surely become even more important in the world of 5G and IoT, right?
  • Sijie: ​Yes. With the rise of 5G, IoT, and a lot of mobile device sensors, IoT devices will become an individual source that will collect all the events data. You’ll need to have a powerful messaging platform to absorb all the traffic so your data science team can process the data.
  • Ben: ​For someone who's not technical and who just wants to have an idea of what these systems can do, at a high level, what are the two types of messaging? what is queuing and what is streaming? And what are some good examples for each?
  • Sijie: ​In terms of communication, it typically divides into two patterns. One is queuing. A simple way to think about this is the queue. For example, when you go into a bank, you wait in a queue for a banker to help you. That is like a worker queue, with task-oriented workloads that are processed on a per-event basis. Since they're processing all events, they don't really care about ordering. This messaging queuing system is common in industries like e-commerce retailers to process payments, transactions, and billing statements. That is one of the most common communication channels, taking the messages, all the events, from the end user to your system.
  • Ben: ​So, they don't care as much about the order. In other words, they just want to reconcile your account at the end of the day.
  • Sijie: ​Yes.
  • Ben: ​But then, they also cannot tolerate any mistakes.
  • Sijie:​ Right.
  • Ben: ​So, what about streaming?
  • Sijie: ​In terms of streaming, you get a sequence of events that you want to collect based on a per-entity or per-stream basis. For example, for an IoT device that you want to use to collect one data point, like temperature, you’d have a sensor collecting all the temperature changes. You’d want to collect those data points, or events, in a particular order so you can virtualize or analyze the changes in sequence. Streaming systems are focused on things like user behavior, fraud detection, and maybe event processing and log collection. For situations where you want to collect the events from a certain device or certain entity in a particular order, and you want to analyze the behavior from the sequence of events.
  • Ben:​ Another example people might be familiar with is logs and web traffic, that kind of thing. It seems like, at a high level, one distinction is queuing systems are used for mission-critical transactions, and then streaming systems are for things like web traffic and logs, where you care about the order of data, but you may be able to tolerate some loss.
  • Sijie: ​Yes, exactly.
  • Ben: ​Interesting. So, let's dive into use cases for queuing and streaming. What are some classic use cases?
  • Sijie: ​For queuing, the most common use is in e-commerce companies for handling payments, transactions, and sometimes billing requests. It might also be used for task-related activities, like processing images or videos that a user uploads to your website or your cloud to ensure you collect the event and process it, and guarantee the event is not lost.
  • Ben: ​For people who are versed in database management lingo, this will be more like, online transaction processing (OLTP).
  • Sijie: ​Exactly. Yes.
  • Ben: ​Then what about use cases for streaming?
  • Sijie: ​Streaming is actually happening in almost every enterprise. If you're building a web service, you have a lot of web logs to collect. If you're building a mobile device or even webpages, you have a lot of user-click events to collect. In the IoT world, you have a lot of device events or sensor events to collect events that you will collect and process in real-time, like analyzing user behavior for edge predictions. You need to analyze this based on the use-click behavior, and then you need to predict which path to send to them. To analyze fraud detection, you want to analyze the behavior of a device based on the events collected from the device to detect those in advance. So analytics-oriented use cases are more attached to streaming.
  • Ben: ​So, this would include business intelligence (BI), dashboards, and maybe even feeding into online learning and machine learning, right?
  • Sijie: ​Yes.
  • Ben: ​I guess in the language of database management, for people who are versed in that, it would be more like online analytical processing (OLAP). Whereas, queuing is more like OLTP, so that begs a question: in database management, usually these two systems are different. You have a system for OLTP, and you have a system for OLAP, if you're a company who's doing both streaming and queuing, in other words, you have transactions that you care about, but then you also have some reports that you care about. Right now, I would imagine you’d have to have two different systems.
  • Sijie: ​Yes, exactly. That's what I have observed when talking to users and customers. In the sense of online surveys in database management, in the OLTP world, people typically use the traditional master queue, RabbitMQ is one of the most popular in this workload.
  • Ben: ​That's an open-source project I hear a lot about. Basically what you're saying is that people who are running these transactional services in real-time are likely to be using RabbitMQ.
  • Sijie: ​Yes. RabbitMQ was originally designed for financial-oriented workloads, it cares a lot about consistency and durability because in those workloads, you cannot tolerate any data loss. In a sense, when it was originally born, it was a single-known system, and it evolved into a replicate system, so you can have standby notes that can achieve high availability. But as business grows, people using RabbitMQ might run into scalability issues, because RabbitMQ is not a distributed system. It cannot tolerate peak traffic. To give you a simple example, in the Singles' Day sales in November in China, and also maybe the Black Friday sales in the US.
  • Ben: ​In China, 11/11 is called Singles' Day, right?
  • Sijie: ​11/11, yes. The traffic is huge and it peaks to very large traffic at a certain point in time. All the orders, all the purchases, all the transactions happen almost within an hour. You have to be able to handle that traffic. So, in that sense, it impacts the end user experience. If you're using RabbitMQ in those systems, you cannot really handle peak traffic. A lot of people look for approaches to build a proxy layer on top of RabbitMQ to make it more like a distributed system. The problem is very similar to how people solve the scalability issue of MySQL in the database management world.
  • Ben: ​Interesting. In the OLTP or the queuing world, RabbitMQ is popular, and I guess everyone who's listening to this podcast probably has at least heard of Apache Kafka, that would be the popular stream messaging system, right?
  • Sijie: ​Yes. Kafka starts basically from a raw collection in log processing, I think. It’s supposed to be designed for solving the data movement problem from your web server logs to your Hadoop system. So, it can be used to speed up the whole data process in the Hadoop world. The concept of moving data in a streaming way coupled with the rise of stream processing made Kafka very popular in workloads like log collection, log processing, and event collecting. But from what I have observed, Kafka is mostly used in analytics-oriented workloads for business intelligence. So, a small percentage of data loss is acceptable for most of the use cases. That's why Kafka's getting popular in that area.
  • Ben: ​At this point, what you're describing are two systems for queuing and streaming, where you're taking data from a variety of data sources and ingesting it, and in both systems that might not be the final step. In fact, that might be the first step. Because, I don't know if you've been paying attention, but there's now also a set of time series databases, right?
  • Sijie: ​Yes.
  • Ben: ​Like TimescaleDB, which is based on Postgres. So, basically you probably ingest data through Kafka, maybe do some data processing in Spark streaming, and then it lands in a time-series database, and then you start doing analytics. Just to clarify, both of these queuing and streaming systems we're talking about are just the first step in this data pipeline, right?
  • Sijie: ​Yes, both queuing and streaming are the first steps for getting data into your pipeline. It solves the communication problem between the two systems, and they don’t target things like TimescaleDB; they're more focused on how you query data based on a certain time range. Time-series databases are designed more for focusing on the query-oriented workload. Queuing and streaming are more about the data movement and the ability to collect data, store data, and maybe a bit of processing.
  • Ben: ​Which is foundational. So for a lot of companies, this is a must-have to do anything at all. If you're going to do even simple reports and BI, or even just counts and averages, you need systems like this.
  • Sijie: ​Yes.
  • Ben: ​Sijie, I first met you when you were part of a startup called Streamlio, which has now been acquired by Splunk. Streamlio was a proponent for a couple of technologies, mainly Apache Pulsar, and you have since started a startup, which is also a proponent of Apache Pulsar. First, your startup evangelized a lot in China for the Pulsar community. But now you've moved back to the Bay area to start building out cloud offerings and services for Apache Pulsar. So at a very high level, where does Pulsar fit into the picture of queuing and streaming?
  • Sijie: ​That is a very good question. As we just mentioned, in the messaging world, you have two different systems: for OLTP workload, you have queuing systems like RabbitMQ, and for offline analytics-oriented workloads, you have Kafka, and the data is separated into two different systems. That means, from a business perspective, that your data science team or your data engineers cannot fully leverage all your event data in a single place, because you have to move the data from your online surveys to your offline or near offline pipeline. That will slow down the whole business innovation process because you have to figure out where the data is stored. That's the issue we have observed a lot in enterprises. I would probably use Tencent as an example for our audience.
  • Ben: ​For the audience, Tencent is one of the largest companies in China. Among other things, they run a service called WeChat, which is the largest social network in China, and they run the second largest payment system in China called WeChat Pay.
  • Sijie: ​Tencent is probably one of the largest Pulsar adopters. They heavily invested in Pulsar and they adopted Pulsar to support their billing platform. That means almost every purchase transaction that happens on Tencent products, like Ben mentioned, WeChat Pay, and there are also a lot of games and eCommerce products within Tencent, so basically every purchase is going through Pulsar. Tencent uses Pulsar mainly for two purposes. First, it's used for online transaction processing. Pulsar is a transaction processing engine, so every purchase goes first to a queue, then buffers in the queue while the transaction engine processes the transactions. That's the online piece. After you process the transactions, you still need to do several things. You need to collect the transactions into a stream processing engine like Flink or Storm, and you need to perform a real-time reconciliation to make sure every purchase or every transaction that the engine processes are correct. Second, you also want to apply machine learning algorithms to look for fraud in the transactions from different entities and users so you can catch fraud as early as possible. As you can see, Pulsar is used both for online transactions and processing as well as for post-transaction processing.
  • Ben: ​So, it's being used to make sure the transactions go through and are correct, but it’s also potentially used for the analytics and reporting that people need in many companies, like dashboards and BI.
  • Sijie: ​Yes, exactly.
  • Ben: ​The point you're trying to make here is that they're using one system to do both of these things, right?
  • Sijie: ​Yes, they are using one system. The biggest benefit is that you can keep one copy of data, so that is the source of truth. You use the same dataset for your online transaction processing pipeline and your post-transaction pipeline for fraud detection and reporting. Having everything coming from one data set guarantees that you get an accurate view of your business. Imagine if you use two different systems, like RabbitMQ for one, while the other is using Kafka, that would cause a large hub of data inconsistency because you’d have to move data from one system to the other. Solving data consistency problems caused by multiple systems is a really hard engineering challenge in every domain. Introducing data inconsistency causes potential data loss that makes your business team have to spend a lot of time figuring out where the problem is coming from. Are the events lost? Are the events duplicated? How do we get the events into a consistent state?
  • Ben: ​It's compelling, and with that said, I have to say, of course, that I'm also friendly with Kafka people. It's a widely used and popular system. But I can see the case you're making of having a single system do both the queuing and the streaming, both the transaction processing and the analytics. This is a problem that a lot of companies probably have to deal with. One question to you, though, Sijie, is that it seems too good to be true or am I missing something? Because in the database world, you usually have two systems. In the data ingestion world, it seems like a single system can handle both. Does that have to do with the fact that Pulsar also had the advantage of coming much later after the two systems were already introduced? So Pulsar had the benefit of actually looking at what needed to be done to unify these two systems?
  • Sijie: ​Yes, there are a couple of points to be made here. The first point is, when I talk to people about a unified messaging and streaming system, some people might use the OLAP and OLTP concepts in database management to do a comparison, but the comparison is a bit different here in the database world: the workload is more about acquiring workload. So, OLAP and OLTP are serving two different types of workload. But in the data ingestion world, the messaging and streaming world, it's more about moving the data and guaranteeing the data is not lost. So both systems share a bit more similarity.
  • Ben: ​It's just that they're moving two different types of data.
  • Sijie: ​Yes.
  • Ben: ​Transactions and then maybe logs on the other hand.
  • Sijie: ​Yes, exactly. And the second point is, as you mentioned, Pulsar has the advantage of coming out after the other messaging systems in the market. So when we started Pulsar and while we iterated on the Pulsar project, it was done in the shadow of giants like Kafka and RabbitMQ. We learned the good parts of those systems and saw mistakes that those systems made, and we adopted the good parts and learned from their experience to build what is now Pulsar.
  • Ben: ​Let me ask you this: let’s say I want to adopt a new piece of open source technology. The first thing I want to know is, is it a good system? It sounds like from what you describe, Pulsar definitely is worth evaluating, given the fact that it can unify queuing and streaming, but how broad is the community? Because we're talking about two systems, systems like Spark or Hadoop or Kafka, or TensorFlow that have a lot of adoption, and their communities are confident that there are a lot of companies using these technologies. There are a lot of contributors and a vibrant ecosystem. What is the ecosystem around Pulsar right now?
  • Sijie: ​Pulsar was open sourced two years ago, and then it graduated. We spent a year or so in the Apache Software Foundation to incubate the project, and grew the number of companies from four to maybe 20 or 30 companies. After that, it graduated as a top-level project and we continued growing adoptions. Right now, across the world, more than 100 companies have adopted Pulsar, and the number is growing. I would expect to see 300 or 500 in the next three to six months. The Pulsar adopters are uniformly distributed across the world. We see the growth of Pulsar adoptions happening in three main regions, like North and South America, and also Europe, especially in France, there are a lot of Pulsar adopters there. In Asia, we have Yahoo in Japan and China, and some companies in Korea as well.
  • Ben: ​Sijie, I don't know if you know, but I did ​a short post​ when I was still at O'Reilly analyzing the traffic of Apache Pulsar. That was pretty interesting. Closing quickly here, I have a couple of questions for you. You have a new startup called StreamNative. Very quickly, what does StreamNative do?
  • Sijie: ​The word StreamNative comes from cloud-native event streaming. Our vision is very clear: we want to use cloud-native in event streaming technology, namely Pulsar, to solve problems in the real-time enterprise stack. We believe that Kubernetes is the foundation of these cloud-native technologies, and Pulsar is designed and implemented in a cloud-native way that can be well-integrated in running on Kubernetes. We’re building our product around this technology to help enterprises solve the real-time problem in their current stack.
  • Ben: ​And I know you've been focused for the last year or so on building the Pulsar community and encouraging adoption, and a hundred plus companies growing to a few hundred in a year is quite impressive. You're also beginning to plan more meetups and even a conference, right?
  • Sijie: ​Yes. We got approval from the Apache Foundation to host two Pulsar conferences in 2020, it’s a Pulsar user conference called the Pulsar Summit. We want to bring Pulsar adopters, Pulsar developers, and maybe some industry experts around streaming and messaging together to discuss the technology use cases and adoptions. One event will be in San Francisco in April, and the other event is in Beijing in September.
  • Ben: ​Very cool. I'm looking forward to staying in touch with you and seeing how, not just Pulsar, but how this whole data ingestion and messaging space evolves. It seems like these are core components in every data infrastructure, so any company that wants to become a machine learning company will need components along these lines that we discussed in this episode, right?
  • Sijie: ​Yes, exactly.
  • Ben: ​One last question for you, Sijie. Since you are one of these people who spends a lot of time in both China and San Francisco, how would you compare the state of messaging and data infrastructure in these two places? I'm sure China has a lot more scale, but it seems like the Bay Area has a lot more of the core contributors to these projects, right?
  • Sijie: ​Yes, more of the contributors are in the US, in North America. But you know, the economy in China is increasing, and the number of users and the volume of data, and also the number of enterprises using our product in China is huge. It's actually a very good strategy for our company to take our technology to China to find a lot of users and to verify that the technology is able to support large amounts of data and a large-scale user base. Through the whole process, finding a community fit is the main tenant of the project in the US, while we’re able to use China as a larger community adoption to verify the technology and mature the project so we can gather feedback and learn a lot from those users. That feedback and experience are then turned into the features in the project and the product. That's how we move from a project community feed into a project product feed.
  • Ben: ​I know I said that was my last question, but this is the actual last question. I’d be remiss if I didn't ask: now that we've set up messaging as an important part of data infrastructure and machine learning and AI, obviously you have a startup and you must be hiring. So this is your chance to tell people what you're looking for and why they might consider joining StreamNative.
  • Sijie: ​Yes, so at StreamNative we are mainly focused on building our cloud offering around Pulsar, and we expect to launch a cloud service around the first quarter of 2020. We are looking for a lot of cloud engineers or streaming experts who are interested in this technology and want to create the next generation of cloud-native streaming products. Feel free to reach out to me, send me a DM on Twitter​, or send me an email.
  • Ben: ​Do people need to relocate to the Bay Area, or is StreamNative open to people working remotely?
  • Sijie: ​We are building a remote company. A lot of our employees are globally distributed, so you are not forced to relocate. We are happy to have employees work in different time zones.
  • Ben: ​Sijie Guo, thank you for joining us today.
  • Sijie: ​Thank you Ben for inviting me to the episode.

This podcast was originally published on The Data Exchange.

© 北京原流科技有限公司Apache、Apache Pulsar、Apache BookKeeper、Apache Flink 及相关开源项目名称均为 Apache 软件基金会商标。条款隐私