a:I["6ed23a8ff0cc",[],"default",1] 9:["$","div",null,{"className":"_comments_4x64e_1","children":[["$","$La","48328136",{"id":48328136,"user":"liampulles","text":"I think the 80/20 solution for reliable workflows is:

- Ensure the workflow is idempotent - if it stops or fails at any point, you should be able to start it from scratch and skip / happily redo various elements.

- Store the messages which trigger workflows.

- Track failures (if your log aggregation is good, even that's enough to start).

Then when the odd thing fails (or sometimes a bunch of things fail, because e.g. a core integration goes down) you can lookup the messages and have a little script or tool to go and re-queue them. This is an easy starting point that can keep you going for a long time until you really approach huge scale.","date":1780083278000,"comments":[],"commentsCount":0}],["$","$La","48326450",{"id":48326450,"user":"pkaler","text":"I found the accompanying blog post excellent. In my experience, systems go from a monolith to a distributed monolith to a reliable distributed system. A durable workflow engine is one of the pieces that is required to get to target state.

https://hatchet.run/blog/durable-execution","date":1780075993000,"comments":[{"id":48328174,"user":"dang","text":"Discussed a while back, for anyone interested:

How to think about durable execution - https://news.ycombinator.com/item?id=46245238 - Dec 2025 (37 comments)","date":1780083460000,"comments":[],"commentsCount":0}],"commentsCount":0}]]}]