Data Engineering
- Help Branching Out / Upgrading Skillset
I've been a data (reporting) analyst for nearly a decade. I got my start writing complex SQL queries to develop reports and expanded into visualization (Tableau, Power BI). Lately I spent time as a data modeler designing DBs and helping create ETLs using ADF (though not putting hands on that process myself, I'm familiar enough with it to make it work).
I'm looking for my next opportunity and finding that my lack of knowledge in Python is creating a blocker. Also, other skills I don't have seem to crop up, like Spark, Hadoop, etc. It seems like the Data Engineer role has been folded into my Data Analyst role and I just can't compete any more.
Would anyone have any suggestions for paths I might take to remedy this? I can work with some online Python courses but I feel I'm not getting the full experience needed to really support the new needs being asked of those in my role. I'm hoping someone might have some suggestions for directions I might take to up-skill myself and be more prepared for the emerging needs of this changing industry.
- The 6 columns essential to a $6B/year database table
When you have a database with 6 Billion dollars (US) flowing through it every year, you need to be able to account and prove exactly every single penny and every single action that occurred. So you better have _A tables for all of the main tables and have these columns to boot.
- Create_user_id – Who/what created the record
- Create_Dt – When exactly was this record created
- Update_user_id – If updated, who updated this record (default null)
- Update_Dt – when was it last updated (default null)
- Archive_Dt – When can we legally destroy these records
- Unique_Trans_id – So that tracing down everything that occurred becomes even easier.
It isn't sexy but it'll be handy if you ever need to trace down things in your database too.
- ARCHIVE_DT or when you can finally delete some shit
Knowing when a record can be disposed of in the future is deeply valuable to keeping database tables clean and containing only useful data.
31-DEC-2999 is pretty common for records that don't have an easily known value.
If you are in an organization that is trying to monetize your data, treat this value as the date when the user will no longer be able to see this record.
If you are in a more sane industry, treat it as the records retention date minus X number of days that match your _A table storage duration or if the record is already in your _A table, the date it will be deleted and no longer recoverable.
You may wish to put some thought to your database backup and rotation schedule so that those records cleared by that date as well but I leave that as an exercise to the reader.
- Reference table design
There are 2 ways of doing reference tables:
unique hand written tables that perfectly match your desired data
or
The RT_ tables pattern mixed with cached views which will give a useful versioned reference table with an effective begin date, meaningful descriptions, version number, the effective end date (If it is set). With the ability to get previous version values if needed, who created the values, when the values were created, who updated the values and when they were updated (And if you follow _A table best practices, all of the previous updates too); not that you would likely need to update the values without doing a version update as well.
Insert in the following order to avoid constraint violations:
RT_TABLE
RT_FIELD_DOMAIN (only need to add entries when creating new reference table views or adding columns to reference tables)
RT_TABLE_FIELD (duplicate old RT_FIELD_DOMAIN values with new table to keep old column names)
RT_FIELD_VALUES (Easiest to do 1 row or column at a time)
Or just insert them all in a single transaction
RT_TABLE design
This is the master reference table for finding what reference tables exist and the versions that exist for them.
| Name | Null | Type | |-----------------+----------+----------------| | REF_TABLE_ID | NOT NULL | NUMBER | | TABLE_ID | | NUMBER | | VERSION | | NUMBER | | NAME | | VARCHAR2(30) | | DESCRIPTION | | VARCHAR2(255) | | COMMENTS | | VARCHAR2(255) | | STATUS | | CHAR(1) | | CREATE_USER_ID | NOT NULL | VARCHAR2(20) | | UPDATE_USER_ID | | VARCHAR2(20) | | CREATE_DT | NOT NULL | DATE | | UPDATE_DT | | DATE | | UNIQUE_TRANS_ID | NOT NULL | NUMBER | | EFF_BEGIN_DT | NOT NULL | DATE | | EFF_END_DT | | DATE | | ARCHIVE_DT | NOT NULL | DATE |
REF_TABLE_ID is the primary key
RT_FIELD_VALUES design
The actual reference table values
| Name | Null | Type | |--------------------+----------+---------------| | REF_TABLE_FIELD_ID | NOT NULL | NUMBER | | FIELD_ROW_ID | NOT NULL | NUMBER | | FIELD_VALUE | | VARCHAR2(255) | | CREATE_USER_ID | NOT NULL | VARCHAR2(20) | | UPDATE_USER_ID | | VARCHAR2(20) | | CREATE_DT | NOT NULL | DATE | | UPDATE_DT | | DATE | | UNIQUE_TRANS_ID | NOT NULL | NUMBER | | ARCHIVE_DT | NOT NULL | DATE |
REF_TABLE_FIELD_ID has a foreign key with RT_TABLE_FIELD.REF_TABLE_FIELD_ID FIELD_ROW_ID a sequence value used for all entries on a row
RT_TABLE_FIELD design
This is the glue table for all of the reference tables
| Name | Null | Type | |--------------------+----------+---------------| | REF_TABLE_FIELD_ID | NOT NULL | NUMBER | | REF_TABLE_ID | NOT NULL | NUMBER | | FIELD_ID | NOT NULL | NUMBER | | CREATE_USER_ID | NOT NULL | VARCHAR2(20) | | UPDATE_USER_ID | | VARCHAR2(20) | | CREATE_DT | NOT NULL | DATE | | UPDATE_DT | | DATE | | UNIQUE_TRANS_ID | NOT NULL | NUMBER | | ARCHIVE_DT | NOT NULL | DATE |
REF_TABLE_FIELD_ID is the primary key (sequence or uuid) REF_TABLE_ID is a foreign key to RT_TABLE.REF_TABLE_ID FIELD_ID is a foreign key to RT_FIELD_DOMAIN.FIELD_ID
RT_FIELD_DOMAIN design
The actual column names for the reference tables
| Name | Null | Type | |-----------------+----------+---------------| | FIELD_ID | NOT NULL | NUMBER | | NAME | | VARCHAR2(50) | | DATA_TYPE | | CHAR(1) | | MAX_LENGTH | | NUMBER(5) | | NULLS_ALLOWED | | CHAR(1) | | CREATE_USER_ID | NOT NULL | VARCHAR2(20) | | UPDATE_USER_ID | | VARCHAR2(20) | | CREATE_DT | NOT NULL | DATE | | UPDATE_DT | | DATE | | UNIQUE_TRANS_ID | NOT NULL | NUMBER | | ARCHIVE_DT | NOT NULL | DATE |
FIELD_ID is the primary key (sequence or uuid)
RT_ALL_MV design
The master query behind all of the reference tables (keep it cached)
CREATE VIEW IF NOT EXISTS RT_ALL_MV AS SELECT A.NAME AS TABLENAME ,A.VERSION AS VERSION ,D.FIELD_ID AS FIELDID ,A.EFF_BEGIN_DT AS EFFBEGDATE ,A.EFF_END_DT AS EFFENDDATE ,B.FIELD_ROW_ID AS ROW_ID ,D.NAME AS COLUMNNAME ,B.FIELD_VALUE AS COLUMNVALUE FROM RT_TABLE A ,RT_FIELD_VALUES B ,RT_TABLE_FIELD C ,RT_FIELD_DOMAIN D WHERE A.REF_TABLE_ID = C.REF_TABLE_ID AND B.REF_TABLE_FIELD_ID = C.REF_TABLE_FIELD_ID AND C.FIELD_ID = D.FIELD_ID;
Example RT_ view
Current values can be just: SELECT * FROM RT_example_MV; For figuring out previous values or making a view:
For sqls that support DECODE
SELECT MAX(DECODE(COLUMNNAME, 'CODE', COLUMNVALUE)) AS CODE ,MAX(DECODE(COLUMNNAME, 'DESCRIPTION', COLUMNVALUE)) AS DESCRIPTION ,MAX(VERSION) AS VERSION ,MAX(EFFBEGDATE) AS EFF_BEGIN_DT ,MAX(EFFENDDATE) AS EFF_END_DT FROM FROM RT_ALL_MV WHERE TABLENAME LIKE '%STATUS_IND%' AND VERSION=3 GROUP BY ROW_ID ORDER BY CODE;
For sqls without
SELECT MAX(CASE COLUMNNAME WHEN 'Code' THEN COLUMNVALUE END) AS 'Code' ,MAX(CASE COLUMNNAME WHEN 'S0_Rate' THEN COLUMNVALUE END) AS 'S0 Rate' ,MAX(CASE COLUMNNAME WHEN 'S1_Rate' THEN COLUMNVALUE END) AS 'S1 Rate' ,MAX(VERSION) AS VERSION ,MAX(EFFBEGDATE) AS EFF_BEGIN_DT ,MAX(EFFENDDATE) AS EFF_END_DT FROM RT_ALL_MV WHERE TABLENAME LIKE '%example%' AND VERSION=1 GROUP BY ROW_ID ORDER BY CODE;
- Rate histories or cleanly storing history
The HIST_NAV_IND column:
When you want a history of values (such as ratings) in the main table for some business requirement, add this column and use the following values:
S => When you have only 1 record
F => The first record when you have more than 1 record
P => The current primary record when you have more than 1
M => The previous P records that have been surpassed.
The EFF_BEGIN_DT and EFF_END_DT columns:
In case you might need to do reprocessing of old records you will want an easy way to figure out which rate history that you would want to use; EFF_BEGIN_DT and EFF_END_DT make that simple.
EFF_BEGIN_DT is always set in every record (generally it should match the create date but there are business reasons why you want it separate)
EFF_END_DT should be NULL for the current primary record (unless you are organized enough to always know the future rate change date in advance [unlikely]) and should always be set for the M and F records to the day [or hour, minute or second] prior to the EFF_BEGIN_DT of the new P record. The EFF_END_DT of one record should never overlap with the EFF_BEGIN_DT of the next and you can use TRUNC("TimeStamp", DATE) to ensure that your select driver will always either get 1 [normally] or zero [They shouldn't have been included] records.
- UNIQUE_TRANS_ID or letting you track what occurred together.
You will find 2 different implementations for this, the first (very wrong) is a unique sequence for every table and it serves the purpose of a HIST_SEQ column.
The second (correct) is a global sequence which will be the same for all records in all tables which are updated by a single transaction. The purpose is to make it trivial to find all records (inserted, updated [ and deleted if using _A tables]) in a single transaction. [You'll want to add an AUDIT_UNIQUE_TRANS_ID column to your _A tables for that linkage]
In simple environments this can be just a simple sequence and in more advanced environments this can be a UUID. The key is it must be unique on every transaction but its value should not be used to provide any information about the order of events in a table (that is the job of a HISTORY_SEQ column).
- HISTORY_SEQ column or sanity checking basic mode
If you might need to store multiple duplicate records or want a sequence number for the order of created/updated records in your table.
This is what you need, the big annoying bit is you need to also update this column on EVERY SINGLE UPDATE to that table and you'll want _A tables if you want to figure out historical ordering of events. And you will be creating a unique sequence for every single table where this column exists. but just shove that functionality in a trigger.
This also would be quite handy if you want a unique key handle for picking which records are being manually deleted and you have the solution when one person updates a record at the same time someone else is trying to delete a record.
- _A tables or how not to accidentally lose your shit
Sometimes called journal or audit tables. _A tables do the following magic trick: you can't screw up or delete your data in a way you can't recover.
In the most simple version possible you take your table foo, duplicate it's structure in a table named foo_A and add 2 columns: audit_dt and audit_user_id. Then you create triggers for update and deletes on the table foo to first write the old values as a new insert in the foo_A table.
Now even if you screw up your select and delete all of the contents of table foo. everything will still be in table foo_A. If you accidentally overwrite everything in foo with garbage data, the good data will still be in foo_A
The application nor any of the users need to know about the _A tables (unless you want to leverage stored procedures instead of triggers to create the _A table entries)
- How do I convince my data engineer to not modify data before including it in our db?
Our data engineer insists in lowercasing everything and removing some other formatting like new lines on free text fields.
They say it's "better for elastic search".
To me that makes no sense and loses information that can't be added back. But I couldn't really convince them otherwise. So far no real problem has come out of it but it makes for a worse experience for the user. Like company names that are acronyms show up as all lowercase. (ibm, llc, etc.) or free text fields that we miss when the user wrote in caps or added paragraphs.
What are your thoughts on this?
Disclaimer, I'm not a data engineer. Just a PM from a data related product.
- Citus Data - Distributed Postgreswww.citusdata.com Citus Data | Distributed Postgres. At any scale.
Citus gives you all the greatness of Postgres plus the superpowers of distributed tables. By distributing your data and queries, your application gets high performance—at any scale. The Citus database is available as open source and as a managed service with Azure Cosmos DB for PostgreSQL.
Citus is a PostgreSQL extension that transforms Postgres into a distributed database—so you can achieve high performance at any scale.
With Citus, you extend your PostgreSQL database with:
- Distributed tables are sharded across a cluster of PostgreSQL nodes to combine their CPU, memory, storage and I/O capacity.
- References tables are replicated to all nodes for joins and foreign keys from distributed tables and maximum read performance.
- Distributed query engine routes and parallelizes SELECT, DML, and other operations on distributed tables across the cluster.
- Columnar storage compresses data, speeds up scans, and supports fast projections, both on regular and distributed tables.
- Query from any node enables you to utilize the full capacity of your cluster for distributed queries
- Cloud Backed SQLite
The Cloud Backed SQLite system allows databases to be stored within cloud storage accounts such that they can be read and written by storage clients without first downloading the entire database to the client.
- Introducing English as the New Programming Language for Apache Sparkwww.databricks.com Introducing English as the New Programming Language for Apache Spark | Databricks Blog
Introducing the English SDK for Apache Spark - an innovative tool using English language input to streamline your development process, enhance code efficiency, and accelerate data insights.
- What is Data Lineage?airbyte.com Data Lineage: The Unseen Lifeline of Data-Driven Organizations | Airbyte
Uncover the vital role of data lineage in driving data-driven organizations forward. Delve into its significance with expert insights
I get questions like this a lot:
- Where did this data come from?
- How do I know I can trust the source?
- What types of QA checks were applied to this data?
Data lineage is such a chronic issue in data engineering. This blog post from Airbyte gives a good overview & mentions some interesting products/projects that can maybe help out with data lineage.
Unfortunately, I have limited flexibility to purchase or install tools for this in my current role. Anyone rolled their own solution for this?
- Design Thinking Bootleg (Stanford)dschool.stanford.edu Design Thinking Bootleg — Stanford d.school
The Design Thinking Bootleg is a set of tools and methods that we keep in our back pockets, and now you can do the same.
(pdf download at the bottom of the linked page)
Often the most challenging part of data engineering is figuring out what problem to solve in the first place.
The resources Stanford put in this Design Thinking Bootleg might have something that can help you work with others and build towards a well-designed solution.
- Array programming with NumPywww.nature.com Array programming with NumPy - Nature
NumPy is the primary array programming library for Python; here its fundamental concepts are reviewed and its evolution into a flexible interoperability layer between increasingly specialized computational libraries is discussed.
For Python developers, the NumPy array is a widely used data structure in data science / data engineering.
Thought this paper was a good resource to learn a bit more about the history and core concepts of NumPy.
- The Missing Semester of Your CS Education
I found this content super helpful and I frequently share this with new coworkers getting started in data / dev.