We have a system using 10,000 processes. Due to a bug, some rogue code occasionally calls a function with some garbage data. This eventually causes total system failure. We can not easily load code onto the system since it sits at a customer site.
Using the trace bifs, we can instruct the VM to set a watchpoint on the called function, a watchpoint that only triggers when the function is called with the bad value, and upon triggering inspects the call stack and tells us which piece of code issued the bad call.
Unfortunately, the interface to the trace bifs is geared towards power, and not is not particularly user friendly. A layer on top of the bifs is needed.
Try this for a laugh (it will kill the node).
5> dbg:tracer(),dbg:p(all,m).Another problem with dbg is that it does not lift the level of abstraction enough (from that of the trace bifs).
Things look grim, what to do?
redbug aims to be a safe and simple interface to the erlang trace functionality, with some extra features thrown in.
The simplicity comes from exposing only the most useful (i.e. the bits I use often) parts of the API in a concise way.
The safety comes partly from the simplicity, but also from the way the redbug process works.
redbug supervises itself and will kill itself if it deems that it is in danger of overloading the node. It also delays the processing of the trace messages until after the trace is done.
The most interesting extra feature is probably the stack match spec. Answers the eternal question: "Who the * is calling my function with this * argument?"
Another potentially interesting feature is remote tracing; i.e. tracing on a node other than the one we run redbug on.
redbug:start(Trc,Opts) Returns: {Procs,Funcs} | {argument_error,Err} | {syntax_error,Err} Procs: number of matched processes Funcs: number of matched functions Err: error descriptor If the blocking option is given, start/2 will return a list of trace messages.
Trc: list('send'|'receive'|string(RTP)) RTP: restricted trace pattern the RTP has the form: "<mfa> when <guards> -> <actions>" where <mfa> can be; "mod", "mod:fun", "mod:fun/3" or "mod:fun('_',atom,X)" <guard> is something like; "X==1" or "is_atom(A)" and <action> is; "return" and/or "stack" (separated by ";") E.g. ets:lookup(T,hostname) when is_integer(T) ->stack Opts: list({Opt,Val}) general opts: time (15000) stop trace after this many ms msgs (10) stop trace after this many msgs target (node()) node to trace on arity (false) print arity instead of arg list blocking (false) block start/2, return a list of messages procs (all) (list of) Erlang process(es) all|pid()|atom(RegName)|{pid,I2,I3} print-related opts max_queue (5000) fail if internal queue gets this long max_msg_size (50000) fail if seeing a msg this big buffered (no) buffer messages till end of trace print_calls (true) print calls print_file (standard_io) print to this file print_msec (false) print milliseconds on timestamps print_depth (999999) formatting depth for "~P" print_re ("") print only strings that match this trc file related opts file (none) use a trc file based on this name file_size (1) size of each trc file file_count (8) number of trc files
6> redbug:start("ets",{time,1000}). {38,136} 17:43:55 <{erlang,apply,2}> {ets,lookup, [ac_tab,{env,stdlib,shell_history_length}]}Trace on ets:lookup, timeout after 1 second.
7> redbug:start("ets:lookup",{time,1000}). {38,1} 17:43:55 <{erlang,apply,2}> {ets,lookup, [ac_tab,{env,stdlib,shell_history_length}]}Trace on ets:lookup/2, match on part of the second argument, timeout after 1 second.
8> redbug:start("ets:lookup(_,{_,_,shell_saved_results})",{time,1000}). {38,1} 17:45:27 <{erlang,apply,2}> {ets,lookup, [ac_tab,{env,stdlib,shell_saved_results}]}Trace on calls to ets:lookup and its return values, timeout after 1 second, max two messages.
9> redbug:start("ets:lookup->return",[{msgs,2},{time,1000}]). {38,1} 18:06:26 <{erlang,apply,2}> {ets,lookup, [ac_tab,{env,stdlib,shell_history_length}]} 18:06:26 <{erlang,apply,2}> {ets,lookup,2} -> []Trace on ets:lookup/2, match first argument, guard on the second argument, timeout after 1 second, max two messages.
10> redbug:start("ets:lookup(ac_tab,X)when not is_tuple(X)->return",[{msgs,2},{time,10000}]). {38,1} 18:09:27 <{erlang,apply,2}> {ets,lookup,[ac_tab,[]]} 18:09:27 <{erlang,apply,2}> {ets,lookup,2} -> []Trace on calls to ets:lookup and its return values, timeout after 1 second, max two messages. Display call stack.
11> redbug:start("ets:lookup->stack,return",[{msgs,2},{time,1000}]). {38,1} 18:05:02 <{erlang,apply,2}> {ets,lookup, [ac_tab,{env,stdlib,shell_history_length}]} shell:eval_loop/3 shell:eval_exprs/7 shell:exprs/7 shell:result_will_be_saved/0 shell:get_history_and_results/0 shell:get_env/2 18:05:02 <{erlang,apply,2}> {ets,lookup,2} -> []
redbug has the idiotic/awesome feature that it can be run from a Unix shell. actually pretty handy if you're working against an embedded system.
bash$ ./priv/bin/redbug foo@localhost "ets:lookup" 22:38:52 <{erlang,apply,2}> {ets,lookup, [ac_tab,{env,stdlib,shell_history_length}]} redbug done, timeout - 7