{"id":145,"date":"2024-07-20T17:23:39","date_gmt":"2024-07-20T20:23:39","guid":{"rendered":"https:\/\/alice.com.br\/tech\/sem-categoria\/how-we-extract-data-for-our-dw\/"},"modified":"2026-02-19T15:43:50","modified_gmt":"2026-02-19T18:43:50","slug":"how-we-extract-data-for-our-dw","status":"publish","type":"post","link":"https:\/\/alice.com.br\/tech\/how-we-extract-data-for-our-dw\/","title":{"rendered":"How we extract data for our DW"},"content":{"rendered":"<h4>A long long time\u00a0ago\u2026<\/h4>\n<p>About a year ago I arrived at Alice with the compromise to make the world healthier. We had less than 100 employees and living from series A, which means short money. A big challenge and a lot of work to\u00a0do!<\/p>\n<h4>The first challenges<\/h4>\n<p>As a data engineer, my responsibility was to provide data for everybody. The team was very small (actually just me and half the time of my boss) and the systems were changing very\u00a0fast.<\/p>\n<p>Facing the situation, the plan needed to be as simple and direct as possible:<\/p>\n<ul>\n<li>Use a relational database as the data warehouse. We choose PostgreSQL hosted in AWS RDS. Easy to configure and maintain.<\/li>\n<li>Use a simple tool to schedule the ETL pipelines. The choice was Airflow since it&#8217;s a very stable, well know tool and we had experience with it (and we love\u00a0Python).<\/li>\n<li>A visualization interface. Metabase was the winner for many motives: Free, previous experience with it, easy to install and configure, self-service, very stable and users usually like\u00a0it.<\/li>\n<\/ul>\n<p>OK\u2026. That&#8217;s something we can make happen! I reused some previous code I wrote in the past to copy data from one place to another and data started to flow! Great! Mission accomplished!<\/p>\n<h4>Problems always\u00a0happen!<\/h4>\n<p>BUT\u2026. almost every day I had dags broken and had to add a table, fix some table structure, re-run dags, and deploy emergency PRs. The problem was for every new column, every modification in size or type, even the order of some selects could break something. How to solve it?<\/p>\n<h4>For every problem, a solution!<\/h4>\n<p>At that point, I started to work on an Airflow plugin for a smarter extractor which was baptized, given my total lack of imagination, &#8220;SmartTransfer&#8221;. The main idea was to check the existence and structure of the table in the destination stage area and automatically create or re-create as\u00a0needed.<\/p>\n<p>The intention was to be really easy to be called, with minimal parameters. Below is a real example of a very simple dag with\u00a0it:<\/p>\n<p><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/7faf4916b61074e1a251a48b55577003\/href\">https:\/\/medium.com\/media\/7faf4916b61074e1a251a48b55577003\/href<\/a><\/iframe><\/p>\n<p><em>(all code is available at <\/em><a href=\"https:\/\/github.com\/alice-health\/airflow_plugins\">https:\/\/github.com\/alice-health\/airflow_plugins<\/a> <em>)<\/em><\/p>\n<p>So, how does it work? No big secret, but there are some steps in the process.<\/p>\n<p>First, we need to check the structure of the source table. We run a query and get one row from the source table and analyze the\u00a0columns.<\/p>\n<p><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/aa1ad187358be8b92f52a8ba5fd8d6c5\/href\">https:\/\/medium.com\/media\/aa1ad187358be8b92f52a8ba5fd8d6c5\/href<\/a><\/iframe><\/p>\n<p>From the table, we can create build the &#8216;create table&#8217; statement.<\/p>\n<p><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/612364fbe3b882fca211d453c0fa7964\/href\">https:\/\/medium.com\/media\/612364fbe3b882fca211d453c0fa7964\/href<\/a><\/iframe><\/p>\n<p>We used the &#8216;unlogged table&#8217; PostgreSQL feature because it should be, in principle, faster. Unlogged tables are faster but not crash-free, so be careful! Since staging is just a copy of production, it&#8217;s not a problem to lose data. <a href=\"https:\/\/www.postgresql.org\/docs\/11\/sql-createtable.html\">You can learn more about PostgreSQL here<\/a>. We didn&#8217;t notice much difference but we have only a few gigabytes. We will benchmark again in the future as our database grows.<\/p>\n<p>Everything set, the logic is quite straightforward:<\/p>\n<ul>\n<li>If the destination table doesn&#8217;t exist, create\u00a0it<\/li>\n<li>If the destination table exists but the structure is different, drop it and recreate\u00a0it<\/li>\n<li>If tables have the same structure, same primary keys, just keep the\u00a0table<\/li>\n<\/ul>\n<p>It&#8217;s important to remember in the case of (re)creating the destination table we need to do a full import, so we set the flag &#8220;need_full_import&#8221; to\u00a0True<\/p>\n<p><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/2bfb41b76b83a85411afcbaf739329d5\/href\">https:\/\/medium.com\/media\/2bfb41b76b83a85411afcbaf739329d5\/href<\/a><\/iframe><\/p>\n<p>Finally, transfer the data. The first step is to know if we are going to do a full import or an incremental import:<\/p>\n<p><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/aaa4f0613d2923850568eed527c52622\/href\">https:\/\/medium.com\/media\/aaa4f0613d2923850568eed527c52622\/href<\/a><\/iframe><\/p>\n<p>and prepare the insert statements. The syntax is PostgreSQL specific, with &#8220;<em>ON CONFLICT DO UPDATE&#8221;<\/em> to create a mixed insert\/update statement:<\/p>\n<p><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/538289ba2b8b9bcd3fb07634be08bca1\/href\">https:\/\/medium.com\/media\/538289ba2b8b9bcd3fb07634be08bca1\/href<\/a><\/iframe><\/p>\n<p>The main loop makes the data transfer. The most important part here is the grouper function: It lazily consumes the source_cursor, keeping in memory only the data will be sent in the current\u00a0block<\/p>\n<p><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/aca4c1f9af8fe1c23f37b2fd81b3a271\/href\">https:\/\/medium.com\/media\/aca4c1f9af8fe1c23f37b2fd81b3a271\/href<\/a><\/iframe><iframe loading=\"lazy\" src=\"\" width=\"0\" height=\"0\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/e52079bf806612b757bf4497fb89a7b4\/href\">https:\/\/medium.com\/media\/e52079bf806612b757bf4497fb89a7b4\/href<\/a><\/iframe><\/p>\n<p>More details of installing, using, and testing SmartTransfer are in the README file in the official repository at <a href=\"https:\/\/github.com\/alice-health\/airflow_plugins\">https:\/\/github.com\/alice-health\/airflow_plugins<\/a>.<\/p>\n<h4>Conclusion<\/h4>\n<p>I tried to cover the most important (and interesting) parts of the code and show how we solved the problem of moving, not hundred of terabytes with 100 people, but a couple of gigabytes with two people, in a way to automatize all table synchronization and copying process to minimize the data engineer manual\u00a0work.<\/p>\n","protected":false},"excerpt":{"rendered":"About a year ago I arrived at Alice with the compromise to make the world healthier. We had less than 100 employees and living from series A, which means short money. A big challen","protected":false},"author":3,"featured_media":184,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-145","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software"],"acf":[],"_links":{"self":[{"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/posts\/145","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/comments?post=145"}],"version-history":[{"count":5,"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/posts\/145\/revisions"}],"predecessor-version":[{"id":265,"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/posts\/145\/revisions\/265"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/media\/184"}],"wp:attachment":[{"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/media?parent=145"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/categories?post=145"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/alice.com.br\/tech\/wp-json\/wp\/v2\/tags?post=145"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}