DATA WRANGLER: 2018

CD-Stream:CDC Replicator Tool & Cons on ETL pipelines

Just another day at the workplace;

5 minutes post the boot:

You hear everyone complain that the production database is slow. You quickly start to investigate; exploring all possible outcomes on the dashboards.

Could it have been the long-running slow query which you had raised a ticket for the production support to fix?.. Or Is it one of the queries run based on an un-indexed column?

6th Minute and 15 minutes down the lane:

Next, you hear the fellow data-analysts lament over their failed reports.

You now realize that your CPU had taken a humongous amount of query load and you understand that your relational database system has gone for a toss into an eternal slumber.

And all of this due to a slow running query of your ETL pipeline..!! Ding... Ding... Ding...!! We have a winner!!!

Alright, let's phrase it this way.

Probably you did/used one of the following:

- SELECT * from production_database.table where updated_at between x and y;

- Airflow pipelines

- Bulk exports and Dumps once in every few minutes

- Long-running and forgotten Zombie Crontabs

Let's put it this way. DB Size <500 GB; it's OK to do selects but if it exceeds >500GB, unless you have cuts in the budget, do not ever do a bulk select and transfer it over the wire to the destination database in the form of a pipeline.

Enter CDC:

CDC A.K.A Change Data Capture is there to assist this data wrangling exercise and if your data is rapidly growing and if your BI/BA needs access to that sweet-sweet DWH (data warehouse); THIS right here, is the way to go.

You meant the replication? - I don't wanna get stuck with my OLTP DB Engine:

Well, you are not alone. For many reasons, data-wranglers generally don't prefer a similar database engine as that of the OLTP. The main reasons include query performance, need for triggers, and ease of re-running transformation jobs. But, doing so, requires setting up airflow clusters and setting up connectors to transform and load.

Wouldn't it be great if you could do all this without doing a bulk SELECT from the production database? Of course, you can.

CD-Stream is a cross-database CDC driven replicator tool that currently supports replication between MySQL and Postgres.

The tool runs queues to process the information occurring in the binary logs of the source database and replicates it across to a destination database of entirely different engines.

https://github.com/datawrangl3r/cd-stream

Post the setup, as given in the project page: CD-Stream; there's a directory called 'sample' in the project which contains some of the intensive DDL and Data Insertion scripts, for you to evaluate and exercise.

And that, People; is how you wrangle production data!!!

The No-BS guide to AutoComplete and FuzzySearch in Elasticsearch

Before we begin.. Here are a few basics.

Analyzer:

An analyzer does the analysis or splits the indexed phrase/word into tokens/terms upon which the search is performed with much ease.

An analyzer is made up of tokenizers and filters.

There are numerous analyzers in elasticsearch, by default;
here, we use some of the custom analyzers tweaked to meet our requirements.

Filter:

A filter removes/filters keywords from the query. Useful when we need to remove false positives from the search results based on the inputs.

We will be using a stop word filter to remove the specified keywords in the search configuration from the query text.

Tokenizer:

The input string needs to be split, to be searched against the indexed documents. We are about to use ngram here, which splits the query text into sizeable terms.

Mappings:

The created analyzer needs to be mapped to a field name, for it to be efficiently used while querying.

T'is time!!!

Now that we have covered the basics, it's time to create our index.

Fuzzy Search:

The first upon our index list is fuzzy search:

Index Creation:

curl -vX PUT http://localhost:9200/books -d @fuzzy_index.json \
--header "Content-Type: application/json"

And, the following books and their corresponding authors are loaded to the index.

name	author
To Kill a Mockingbird	Harper Lee
When You're Ready	J.L. Berg
The Book Thief	Markus Zusak
The Underground Railroad	Colson Whitehead
Pride and Prejudice	Jane Austen
Ready Player One	Ernest Cline

When a fuzzy query such as:

This query with the match keyword as "ready" returns the matched books ready as a keyword in the phrase; as,

AutoComplete:

Next up, is the autocomplete. The only difference between a fuzzy search and an autocomplete is the min_gram and max_gram values.

In this case, depending on the number of characters to be auto-filled, the min_gram and max_gram values are set, as follows:

Elasticsearch to MongoDB Migration - MongoES

The following are some of the instances where the developers simply love to hate!

The one-last-thing syndrome - This reminds me of the following quote:

The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.

—Tom Cargill, Bell Labs, from the book `Programming Pearls `

QAs declaring certain undocumented features to be as bugs - Seriously, this create traumas for a devloper.
Interruptions during coding - Here's an idea. Try talking to developers while they code; chances are, they have just about <10% of your attention.

There are some problems which we get used to..

But, there are others which makes us wanna do this..

DISCONNECTION FROM THE SERVER DUE TO BAD INTERNET DURING A MIGRATION - Ouch!! That's gotta hurt real bad.

Talking about ES to MongoDB Migration

- How hard could that be?

Good Side:

JSON objects are common for both.

Numerous tools to choose from, for migration.

Bad Side:

The Migration can be hideous, and can eat up a lot of the system resources. Be ready for a system-freeze, in case the migration tool uses a queue.

Ugly Side:

Can never be resumed from the point of failure. If the connectivity goes down during the migration; the transferred collection has to be deleted and the data transfer has to be initiated once again from the beginning.

Alright, there's nothing there to be felt bad about.

Enter, MongoES.

MongoES is a pure python3-bred Migration tool to migrate documents from the elasticsearch's index to the MongoDB collections.

It's robust in it's native way; no queues/message brokers are involved; which means that there won't be any memory spikes or system freezes.

This became achievable due to the fact that MongoES specifically uses a tagging strategy prior to the migration. The tagging happens in the source elasticsearch, which stands as a checkpoint during the migration.

Why a new custom id tagging, while there's an _id already?

Unless the documents are explicitly tagged, the _id fields in elasticsearch documents are a bunch of alphanumeric strings generated to serialize the documents. These _id columns become unusable, since queries/aggregations can not be run using them.

MongoES - How to:

Install all the Prerequisites.
Clone the repository from https://github.com/datawrangl3r/mongoes.git
Edit the mongoes.json file according to your requirements.

Make sure that both the elasticsearch and mongoDB services are up and running, and fire up the migration by keying in:

Sit back and relax; for we got you covered! The migration's default value is 1000 documents per transfer.

Happy Wrangling!!! :)