Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can’t be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist.
Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You’ll be able to:
- Turn textual information into a form that can be analyzed by standard tools.
- Make the connection between analytics and Big Data
- Understand how Big Data fits within an existing systems environment
- Conduct analytics on repetitive and non-repetitive data
- Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it
- Shows how to turn textual information into a form that can be analyzed by standard tools.
- Explains how Big Data fits within an existing systems environment
- Presents new opportunities that are afforded by the advent of Big Data
- Demystifies the murky waters of repetitive and non-repetitive data in Big Data
About the Author
Dan has more than 25 years of experience in the Data Warehousing and Business Intelligence field and is internationally known for inventing the Data Vault 1.0 model and the Data Vault 2.0 System of Business Intelligence. He helps business and government organizations around the world to achieve BI excellence by applying his proven knowledge in Big Data, unstructured information management, agile methodologies and product development. He has held training classes and presented at TDWI, Teradata Partners, DAMA, Informatica, Oracle user groups and Data Modeling Zone conference. He has a background in SEI/CMMI Level 5, and has contributed architecture efforts to petabyte scale data warehouses and offers high quality on-line training and consulting services for Data Vault.
Most helpful customer reviews
0 of 0 people found the following review helpful.
An excellent book. Shows perspectives of data architecture for different ...
An excellent book.
Shows perspectives of data architecture for different manner to work with data. Differences between an original definition of term for repetitive and non-repetitive data. Best practices for Data Vault!
0 of 0 people found the following review helpful.
Talk about repetitive data :)
By Rodney A. Stainback
This book was great but it repeated itself a lot. Overall I would recommend it though, great job on combining data lakes and data vault.
34 of 37 people found the following review helpful.
See all 16 customer reviews...
Less than a primer, not for data scientist, not really general data architecture, maybe the most mature Data Vault cookbook
By Dominic Roy
Putting 'primer' in the title should warn you not to expect too much. Bill Inmon used to deliver more than that.
The problem with a primer is that the authors don't have to justify, exemplify or detail anything. Things are presented like this and you have no place to make a choice. It's not even take it or leave it, it's only take it. I mean most of the things look correct if you apply them and you happen to have the chance to have a situation where it fits. If you don't fit, you have no escape. A primer should present only clear simple concepts that are recognized throughout the community and ALL the concepts pertinent to the title. Imagine a data warehouse book where slow changing dimension is not mentioned, nor bitemporality, CWM, metamodel. OLAP is only mentioned in the glossary. Imagine a data architecture book where the words cartesian, constraints, enumeration or domain are not used. Even conceptual model is not used in the standard meaning. Those are cues that all the territory is not covered.
I would not recommend this book for a university student, a data professional or a data scientist. Just look at the glossary to convince you. A data model is defined as "an abstraction of data". DW 2.0 is defined as "the second-generation data warehouse architecture". MapReduce is defined as "a language for processing Big Data". A relational model is defined as "a form of data where data is normalized". Even Wikipedia can do better than that. Why putting terms in a glossary in a book if the terms are less precisely defined and/or do not help to contextualize the terms with the subject of the book. It leaves a bad taste for the rest of the book (The semantics may be loose, imprecise with many shortcuts and confusion).
This book tries to cover a lot of technologies in very few pages. A very large part is dedicated to Data Vault and it is, as usual, somewhat self-promoting. However, it could be the best book on Data Vault as far as I know.
I recommend that you skip right over the topics you already know and those who aren't the main subject, because the book presents a limited understanding of those topics : data governance, SDLC, CMMI, TQM, methodologies, Sarbanes-Oxley Act, Agile, Analytics, etc. They seem to be there in an attempt to cover all the topics, but it's not convincing. In my opinion, those topics don't have their place in a technical primer.
There is no bibliography at all which is not very good for a primer that is supposed to introduce you to a topic and guide you to more detailed information if you need. It's disturbing that this topic don't have any scientific paper or any serious monography to refer to. Hey those guys are geniuses, they don't need it; I'm sorry even Albert Einstein made mistakes that were corrected by means of reviewed scientific publications.
In summary, it is not a primer, it is closer to a Data Vault cookbook in a data warehouse environment, with an extension on unstructured data that is not bad. Really, it looks like the most mature book on Data Vault, but you'll have to clean the place, make your own experiments and check the coherence before applying it in a major project. Buy the book, discard some sections, put you own bookmarks, strikethrough the parts that are unproved or wrong, rephrase and fill the book with your experimentation notes.