小孩晚上睡觉发梦癫什么原因| 梦见去扫墓是什么预兆| 主动脉钙化什么意思| 葳蕤是什么中药| 阴囊两侧瘙痒是什么原因| 子午相冲是什么意思| 用盐洗脸有什么好处| 女人右手断掌代表什么| 蒙奇奇是什么动物| 上水是什么意思| 驿是什么意思| 窝里横是什么意思| 科目三考什么内容| 梦见自己生病住院了是什么意思| 头晕没精神是什么原因| asics是什么牌子| 出生日期查五行缺什么| 形同陌路是什么意思| 大人是什么意思| 过敏是什么意思| 元参别名叫什么| 核桃壳有什么用处| 石家庄为什么叫国际庄| 糖尿病什么水果不能吃| 什么是扁平足| 109是什么意思| 女人男相有什么说法| 梦见人头是什么征兆| 卡尔文克莱恩是什么牌子| 12月9号是什么星座| 一个虫一个离念什么| 生长发育科检查些什么| 例假少吃什么药| 盗汗挂什么科| 湉是什么意思| 十月二十七是什么星座| 三个句号代表什么意思| 耳鸣用什么滴耳液| 猫咪踩奶是什么意思| ich是什么意思| 雄性激素是什么| 小孩口臭是什么原因| 百无一用是什么意思| 什么的河水| hrd阳性是什么意思| 18是什么生肖| 4月28号是什么星座| 脸上长红色的痘痘是什么原因| epa是什么| 村姑是什么意思| 手串19颗代表什么意思| 侵犯什么意思| 肝硬化是什么| 软著是什么| 七字五行属什么| 猫什么时候绝育| 出单是什么意思| 口腔痛什么原因| 引火归元是什么意思| 喝什么可以解酒| 肝病看什么科室| tao是什么意思| 碧螺春属于什么茶| 结节钙化是什么意思| 什么颜色对眼睛有好处| 婴儿掉头发是什么原因| 什么是讨好型人格| 中耳炎用什么药| 什么肉不含嘌呤| 不什么思什么| 耳钉后面的塞子叫什么| 雨花茶是什么茶| 1958年属什么生肖| 5月11日什么星座| 三竖一横念什么| 省公安厅副厅长是什么级别| 长期胃胀是什么原因| 九天揽月是什么意思| 势利眼是什么意思| 肝着是什么意思| c919是什么意思| 沉不住气什么意思| 粽子叶是什么植物的叶子| ppl是什么意思| 毕业送老师什么礼物好| 米诺地尔搽剂和米诺地尔酊有什么区别| 吃什么有助于降血压| 鸭子炖汤和什么一起炖最有营养| 梦到被狗咬是什么意思| 打嗝不停吃什么药| 花甲吃什么| 风湿免疫科是什么病| 类风湿有什么特效药| mac是什么意思啊| 打乙肝疫苗挂什么科| 胆碱能性荨麻疹吃什么药| 一个口一个犬读什么| 黄皮果是什么水果| 节瓜煲汤放什么材料| 唇红是什么原因| 颜值爆表是什么意思| 吕布是什么生肖| 245是什么意思| 61岁属什么生肖| 985高校是什么意思| 为什么会胎停| 邮编什么意思| 皮肤出现红点是什么原因| hhh是什么意思| 睡觉起来口苦是什么原因| 火钳刘明什么意思| gjb2基因杂合突变是什么意思| 车前草有什么功效和作用| 孔子名什么| 同字五行属什么| 双肺门不大是什么意思| 耳道湿疹用什么药| 尿毒症可以吃什么水果| 为什么贫穷| 复方木尼孜其颗粒治什么病| 六月飞雪是什么意思| 胃胀痛什么原因| 双子座是什么时候| 杞人忧天是什么意思| 痘痘里面挤出来的白色东西是什么| 手指抽筋是什么原因| 黑裤子配什么颜色上衣| 长痘不能吃什么| 血压高压高低压正常是什么原因| 河南话信球是什么意思| 3月23日什么星座| 鸭肉不能和什么一起吃| 何炅和谢娜是什么关系| 6月份怀孕预产期是什么时候| 电瓶车什么牌子好| gbs是什么意思| 加湿器用什么水比较好| 羊水为什么叫羊水| 女性支原体阳性是什么意思| 内服什么可以美白全身| 甲沟炎是什么样子的| 一直想大便是什么原因| 胃肠感冒发烧吃什么药| 暖五行属什么| 大云是什么| 布谷鸟叫有什么征兆| 藕粉色是什么颜色| 1999属什么| 破财消灾什么意思| 办理生育登记有什么用| 咳嗽有黄痰吃什么药| 夏天脚冷是什么原因| 镇党委副书记是什么级别| 长期便秘吃什么药好| 感冒打喷嚏吃什么药| 经常打嗝放屁是什么原因| 江西庐山产什么茶| 分心念什么| 紧急避孕药叫什么名字| 动爻是什么意思| 太阳鱼吃什么食物| 孕妇吃什么水果好| 瘦人吃什么长胖| 百分位是什么意思| 后脑勺长白头发是什么原因| 姓姜的男孩起什么名字好| samedelman是什么牌子| 糖尿病患者适合吃什么水果| 血常规异常是什么意思| 女生男相的是什么命| 除皱针什么牌子效果最好| 脉细滑是什么意思| 炼奶是什么| 煜字五行属什么| maggie什么意思| 三什么五什么| 肾囊肿有什么症状| 牙齿什么颜色最健康| 吃什么治疗便秘| 王不见王是什么意思| 背部长痘痘是什么原因造成| 男人吃韭菜有什么好处| 复方石韦胶囊治什么病| 属虎的生什么属相的宝宝好| 肌肉痛是什么原因| 早上口苦是什么原因| 男性硬下疳是什么样子| 余事勿取什么意思| 什么风什么面| 宝宝病毒性感冒吃什么药效果好| 金字旁的字有什么| 利尿是什么意思| 脂肪肝适合吃什么水果| 梦见老人去世预示什么| 石龙子吃什么| 带状疱疹是什么样的| 天蝎女和什么座最配| 武则天墓为什么不敢挖| 肌肉的作用是什么| 6月25日是世界什么日| 苦瓜有什么营养| 理性什么意思| 假体隆胸什么材料好| 唾液粘稠是什么原因| 医学P代表什么| 右边脸颊长痘是什么原因| 度化是什么意思| 速度是70迈心情是自由自在什么歌| 指甲开裂是什么原因| 门牙旁边的牙齿叫什么| 董酒是什么香型| 狗狗中毒了用什么办法可以解毒| 嗓子沙哑是什么原因| 怼怼是什么意思| 舒俱来是什么宝石| 染色体xy代表什么| 什么的竹叶| 水瓶座的幸运色是什么| 墨菲定律是什么| 喝紫苏水有什么功效| 西洋参是补什么的| fast什么意思| 外聘是什么意思| 憨笑是什么意思| 藏红花适合什么样的人喝| 专项变应原筛查是什么| 读书有什么好处| 吃小龙虾不能和什么一起吃| 28度穿什么衣服合适| 仓鼠咬笼子是什么原因| 儿童喝蜂蜜水有什么好处和坏处| 月经一直不停有什么办法止血| 单抗是什么药| 眼睛模糊吃什么药| 山竹和什么不能一起吃| 什么是什么意思| 上不下大是什么字| 左氧氟沙星有什么副作用| 仙人跳是什么意思| 一醉方休什么意思| 黑乎乎的什么| 黄芪什么人不能喝| 头发变黄是什么原因| 皮肤瘙痒症用什么药| 鸡尾酒是什么意思| 谷氨酰转移酶高是什么病| 上火流鼻血吃什么降火| 升血压吃什么药| 眼睛老跳是什么原因| 什么怎么什么造句| 锲而不舍下一句是什么| tdi是什么意思| 心什么如什么| 西安古代叫什么名字| 独家记忆是什么意思| 12月2号什么星座| 吃什么hcg翻倍快| 碳元素是什么| 过敏性鼻炎用什么药效果最好| 人为什么会发烧| 卡地亚蓝气球什么档次| xgrq是什么烟| 性格好的女生是什么样| 为什么子宫会隐隐作痛| 百度
Skip to content
Data Observability Updated Jan 28 2021

扒窃团伙流窜集市作案 邹城警方重拳出击全擒获

AUTHOR | Ryan Kearns
百度 因为他们主动把主动权放掉了,好事嘛。

In this article series, we walk through how you can create your own data observability monitors from scratch, mapping to five key pillars of data health. Part 1 of this series was adapted from Barr Moses and Ryan Kearns’ O’Reilly training, Managing Data Downtime: Applying Observability to Your Data Pipelines, the industry’s first-ever course on data observability. The associated exercises are available here, and the adapted code shown in this article is available here.

From null values and duplicate rows, to modeling errors and schema changes, data can break for many reasons. Data testing is often our first line of defense against bad data, but what happens if data breaks during its life cycle??

We call this phenomenon data downtime, and it refers to periods of time where data is missing, erroneous, or otherwise inaccurate. Data downtime prompts us to ask questions such as:

  • Is the data up to date?
  • Is the data complete?
  • Are fields within expected ranges?
  • Is the null rate higher or lower than it should be?
  • Has the schema changed?

To trigger an alert when data breaks and prevent data downtime, data teams can leverage a tried and true tactic from our friends in software engineering: monitoring and observability

We define data observability as an organization’s ability to answer these questions and assess the health of their data ecosystem. Reflecting key variables of data health, the five pillars of data observability are: 

  • Freshness: is my data up to date? Are there gaps in time where my data has not been updated?
  • Distribution: how healthy is my data at the field-level? Is my data within expected ranges?
  • Volume: is my data intake meeting expected thresholds?
  • Schema: has the formal structure of my data management system changed?
  • Lineage: if some of my data is down, what is affected upstream and downstream? How do my data sources depend on one another?

It’s one thing to talk about data observability in this conceptual way, but a complete treatment should pull back the curtain — what does data observability actually look like, under the hood, in the code?

It’s difficult to answer this question entirely, since the details will depend on one’s choice of data lake vs data warehouse, BI tools, preferred languages and frameworks, and so on. Even so, addressing these problems using lightweight tools like SQLite and Jupyter could be useful.

In this article, we walk through an example data ecosystem to create our own data quality monitors in SQL and explore what data observability looks like in practice.

Let’s take a look.

Data Observability in practice

This tutorial is based on Exercises 1 of our O’Reilly course, Managing Data Downtime. You’re welcome to try out these exercises on your own using a Jupyter Notebook and SQL. We’ll be going into more detail, including exercises 2, 3 and 4, in future articles.

Our sample data ecosystem uses mock astronomical data about habitable exoplanets. For the purpose of this exercise, I generated the dataset with Python, modeling anomalies off of real incidents I’ve come across in production environments. This dataset is entirely free to use, and the utils folder in the repository contains the code that generated the data, if you’re interested.

I’m using SQLite 3.32.3, which should make the database accessible from either the command prompt or SQL files with minimal setup. The concepts extend to really any query language, and these implementations can be extended to MySQL, Snowflake, and other database environments with minimal changes.

A database entry in EXOPLANETS contains the following info:

0._id: A UUID corresponding to the planet.
1. distance: Distance from Earth, in lightyears.
2. g: Surface gravity as a multiple of g, the gravitational force constant.
3. orbital_period: Length of a single orbital cycle in days.
4. avg_temp: Average surface temperature in degrees Kelvin.
5. date_added: The date our system discovered the planet and added it automatically to our databases.

Note that one or more of distance, g, orbital_period, and avg_temp may be NULL for a given planet as a result of missing or erroneous data.

sqlite> SELECT * FROM EXOPLANETS LIMIT 5;

Note that this exercise is retroactive — we’re looking at historical data. In a production data environment, data observability is real time and applied at each stage of the data life cycle, and thus will involve a slightly different implementation than what is done here.

For the purpose of this exercise, we’ll be building data observability algorithms for freshness and distribution, but in future articles, we’ll address the rest of our five pillars — and more.

Freshness

The first pillar of data observability we monitor for is freshness, which can give us a strong indicator of when critical data assets were last updated. If a report that is regularly updated on the hour suddenly looks very stale, this type of anomaly should give us a strong indication that something is off.

First, note the DATE_ADDED column. SQL doesn’t store metadata on when individual records are added. So, to visualize freshness in this retroactive setting, we need to track that information ourselves.

Grouping by the DATE_ADDED column can give us insight into how EXOPLANETS updates daily. For example, we can query for the number of new IDs added per day:

You can run this yourself with $ sqlite3 EXOPLANETS.db < queries/freshness/rows-added.sql in the repository. We get the following data back:

Based on this graphical representation of our dataset, it looks like EXOPLANETS consistently updates with around 100 new entries each day, though there are gaps where no data comes in for multiple days.

Recall that with freshness, we want to ask the question “is my data up to date?” — thus, knowing about those gaps in table updates is essential to understanding the reliability of our data.

Plotting rows_added vs. date_added
Freshness anomalies!

This query operationalizes freshness by introducing a metric for DAYS_SINCE_LAST_UPDATE. (Note: since this tutorial uses SQLite3, the SQL syntax for calculating time differences will be different in MySQL, Snowflake, and other environments).

The resulting table says “on date X, the most recent data in EXOPLANETS was Y days old.” This is information not explicitly available from the DATE_ADDED column in the table — but applying data observability gives us the tools to uncover it.

Plotting days_since_last_update vs. date_added

Now, we have the data we need to detect freshness anomalies. All that’s left to do is to set a threshold parameter for Y — how many days old is too many? A parameter turns a query into a detector, since it decides what counts as anomalous (read: worth alerting) and what doesn’t. (More on setting threshold parameters in a later article!).

The data returned to us represents dates where freshness incidents occurred.

On 2020–05–14, the most recent data in the table was 8 days old! Such an outage may represent a breakage in our data pipeline, and would be good to know about if we’re using this data for anything worthwhile (and if we’re using this in a production environment, chances are, we are).

Freshness detections

Note in particular the last line of the query: DAYS_SINCE_LAST_UPDATE > 1;.

Here, 1 is a model parameter — there’s nothing “correct” about this number, though changing it will impact what dates we consider to be incidents. The smaller the number, the more genuine anomalies we’ll catch (high recall), but chances are, several of these “anomalies” will not reflect real outages. The larger the number, the greater the likelihood all anomalies we catch will reflect true anomalies (high precision), but it’s possible we may miss some.

For the purpose of this example, we could change 1 to 7 and thus only catch the two worst outages on 2020–02–08 and 2020–05–14. Any choice here will reflect the particular use case and objectives, and is an important balance to strike that comes up again and again when applying data observability at scale to production environments.

Below, we leverage the same freshness detector, but with DAYS_SINCE_LAST_UPDATE > 3; serving as the threshold. Two of the smaller outages now go undetected.

Image for post
Note the two undetected outages — these must be fewer than 3-day gaps.

Now, we visualize the same freshness detector, but with DAYS_SINCE_LAST_UPDATE > 7; now serving as the threshold. All but the two largest outages now go undetected.

Image for post

Just like planets, optimal model parameters sit in a “Goldilocks Zone” or “sweet spot” between values considered too low and too high. These data observability concepts (and more!) will be discussed in a later article.

Distribution

Next, we want to assess the field-level, distributional health of our data. Distribution tells us all of the expected values of our data, as well as how frequently each value occurs. One of the simplest questions is, “how often is my data NULL”? In many cases, some level of incomplete data is acceptable — but if a 10% null rate turns into 90%, we’ll want to know.

This query returns a lot of data! What’s going on?

The general formula CAST(SUM(CASE WHEN SOME_METRIC IS NULL THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*), when grouped by the DATE_ADDED column, is telling us the rate of NULL values for SOME_METRIC in the daily batches of new data in EXOPLANETS. It’s hard to get a sense by looking at the raw output, but a visual can help illuminate this anomaly:

The visuals make it clear that there are null rate “spike” events we should be detecting. Let’s focus on just the last metric, AVG_TEMP, for now. We can detect null spikes most basically with a simple threshold:

As detection algorithms go, this approach is something of a blunt instrument. Sometimes, patterns in our data will be simple enough for a threshold like this to do the trick. In other cases, though, data will be noisy or have other complications, like seasonality, requiring us to change our approach.

Image for post

For example, detecting 2020–06–02, 2020–06–03, and 2020–06–04 seems redundant. We can filter out dates that occur immediately after other alerts:

Note that in both of these queries, the key parameter is 0.9. We’re effectively saying: “any null rate higher than 90% is a problem, and I need to know about it.”

Image for post

In this instance, we can (and should) be a bit more intelligent by applying the concept of rolling average with a more intelligent parameter:

Image for post

One clarification: notice that on line 28, we filter using the quantity AVG_TEMP_NULL_RATE — TWO_WEEK_ROLLING_AVG. In other instances, we might want to take the ABS() of this error quantity, but not here — the reason being that a NULL rate “spike” is much more alarming if it represents an increase from the previous average. It may not be worthwhile to monitor whenever NULLs abruptly decrease in frequency, while the value in detecting a NULL rate increase is clear.

There are, of course, increasingly sophisticated metrics for anomaly detection like Z-scores and autoregressive modeling that are out of scope here. This tutorial just provides the basic scaffolding for field-health monitoring in SQL; I hope it can give you ideas for your own data!

What’s next?

This brief tutorial intends to show that “data observability” is not as mystical as the name suggests, and with a holistic approach to understanding your data health, you can ensure high data trust and reliability at every stage of your pipeline.

In fact, the core principles of data observability are achievable using plain SQL “detectors,” provided some key information like record timestamps and historical table metadata are kept. It’s also worth noting that key ML-powered parameter tuning is mandatory for end-to-end data observability systems that grow with your production environment.

Stay tuned for future articles in this series that focus on monitoring anomalies in distribution and schema, the role of lineage and metadata in data observability, and how to monitor these pillars together at scale to achieve more reliable data.

Until then — here’s wishing you no data downtime!

Interested in learning more about how to apply data observability at scale? Reach out to Ryan and book a time to speak with us using the form below.

Our promise: we will show you the product.

出什么什么什么 绅士是什么意思 戊五行属什么 貌不惊人什么意思 政治庇护是什么意思
脸上爱出油是什么原因 木节念什么 为什么晚上睡觉老是做梦 什么三什么四 什么叫hpv
拿什么拯救你我的爱人演员表 什么是奇数什么是偶数 疤痕憩室什么意思 脱疽是什么意思 中药一般什么时候喝最好
秋天有什么花开 婴儿吃什么奶粉好吸收 隔离的作用是什么 碳酸氢钠有什么作用 levi是什么意思
最大的沙漠是什么沙漠hcv8jop2ns4r.cn 汕头有什么好玩的地方cl108k.com 鼻窦炎是什么hcv7jop9ns3r.cn 多囊卵巢综合症是什么hcv8jop8ns9r.cn 鱼油有什么副作用wuhaiwuya.com
什么是diyhcv9jop5ns6r.cn 夏天用什么带饭不馊hcv9jop5ns8r.cn 癌症晚期吃什么食物好hcv7jop6ns5r.cn 吃避孕药有什么危害luyiluode.com 过敏性鼻炎吃什么食物好hcv8jop6ns2r.cn
王毅是什么级别hcv9jop2ns8r.cn 血小板偏高有什么危害hcv8jop6ns4r.cn 枫树叶子像什么hcv9jop6ns3r.cn 脚后跟痛是什么问题hcv9jop6ns8r.cn 遇难呈祥是什么生肖hcv9jop6ns4r.cn
肆意洒脱 什么意思hcv9jop2ns6r.cn 扒是什么意思hcv8jop8ns8r.cn 水对什么hcv9jop6ns6r.cn 血小板分布宽度低是什么原因hcv8jop5ns5r.cn aqua是什么牌子wzqsfys.com
百度