I have to prepare an answer to a question for an university assignment. It's about stream processing (a topic I'm not completly clueless), but what makes this difficult for me is the fact that it's about stream processing parallelization. This is the exact question:
A key mechanism to horizontally scale stream processing topologies is auto-parallelization, i.e., identifying regions in the data
flow that can be executed in parallel, potentially on different machines. How do key based aggregations, windows or other stateful operators affect parallelization? What challenges arise when scaling out such operators?
Has someone more experienced with stream processing an answer?
I understand for sketchy shit like this, textbooks and internet are not that reliable. But why not ask your classmates, tutor, prof?
>>331277
I don't know a thing about this, but it seems directly analogous to ILP data hazards from computer architecture, and to every single thing in parallel algorithms.
All the data your function applies to obviously has to pass through the node it's being computed on, which means if you can't break it down into simpler functions that can be parallelised and maybe just reduced on one node, then you're going to have to shove the whole stream through one node, which would suck.