Step into Informatica World: September 2010

Thursday, September 30, 2010

How to add constraint on table?

After creating student_target table, if we want
to make studentno as primary key then add constraint
on that.

ALTER TABLE STUDENT_TARGET ADD CONSTRAINT STUDENT_PK PRIMARY KEY (STUDENTNO)

here STUDENT_PK is the constraint name

Informatica not able to load on a table having index?

Assume your target is Oracle, and your table has primary key and index.

When we try to insert data from informatica to table, Informatica gets
Oracle error saying that index exists on the table not able to load.

If you then try to drop the index Oracle doesnt allow you drop
index created on primary key.

To come out from this problem
Way 1:
==================================
In this case , we need to disable informatica Bulk load option

we can see this option in Task properties window (double click task) -> Mapping
tab . and then select target table from left panel. we will see target load
type

Way 2:
=====================================
Disable Parallel load option from oracle
connection. we can see this options when
creating connection in workflow manager

Wednesday, September 29, 2010

what is the codepage in informatica?

The codepage is an character encoding scheme used by services or tools in informatica

Repository Service codepage:
The Repository Service uses the character set encoded in the repository code page when writing data to the repository. You cannot change the code page in the Repository Service properties after you create the Repository Service.

Metadata Manager repository code page:
The Metadata Manager Service and Metadata Manager application use the character set encoded in the repository code page when writing data to the Metadata Manager repository.

ex: Codepage: MS Windows Latin 1 (ANSI), superset of Latin1

what is the defaulrt port for informatica adminconsole?

By default it uses 6001. we can configure this port either while installation or
by modifying server.xml file in informatica home

Ex:
C:\Informatica\PowerCenter8.6.1\server\tomcat\conf

By default URL to access admin console is

http://hostname:6001/adminconsole/

ex: http://avatar:6001/adminconsole/

Friday, September 24, 2010

Explain SMP and MPP systems?

In terms of software, there are two different types of database software: symmetric multiprocessing (SMP) and massively parallel processing (MPP).

An SMP database system is a database system that runs on one or more machines with several identical processors sharing the same disk storage. When an SMP database system runs on more than one machine, it is called a clustered configuration. The database is physically located in a single disk storage system.

Examples of SMP database systems are SQL Server, Oracle, DB/2, Informix, and Sybase.

An MPP database system is a database system that runs on more than one machine where each machine has its own disk storage. The database is physically located in several disk storage systems that are interconnected to each other. An MPP database system is also known as a parallel database system.
Examples of MPP database systems are Teradata, Neoview, Netezza, and DATAllegro.

The machines in SMP and MPP database systems are called nodes.

An MPP database system is faster and more scalable than an SMP database system. In an MPP database system, a table is physically located in several nodes, each with its own storage

what is Real-Time Data Warehouse?

A real-time data warehouse is a data warehouse that is updated (by the ETL) the
moment the transaction happens in the source system.
For example, you can put triggers on the sales transaction table in the source system so that whenever there is a transaction inserted into the database, the trigger fires and sends the new record to the data warehouse as a message. The data warehouse has an active listenerthat captures the message the moment it arrives, cleanses it, DQs it, transforms it, and inserts it into the fact table immediately. I’m talking about a two-second time difference here, between the moment a customer purchased a product on the web site and the moment data is available in the fact table.

what is ODS?

An ODS is a relational, normalized data store containing the transaction data and current values of master data from the OLTP system

An ODS does not store the history of master data such as the customer, store,and product. When the value of the master data in the OLTP system changes, the ODS is
updated accordingly. An ODS integrates data from several OLTP systems. Unlike a data warehouse,an ODS is updatable.

Because an ODS contains integrated data from several OLTP systems, it is an ideal place
to be used for customer support

who are fathers of data warehousing?

Bill Inmon and from Ralph Kimball,the fathers of data warehousing:

• According to Bill Inmon, a data warehouse is a subject-oriented, integrated, nonvolatile,and time-variant collection of data in support of management’s decisions.

• According to Ralph Kimball, a data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and
implements querying and analysis for the purpose of decision making.

Both of them agree that a data warehouse integrates data from various operational source systems. In Inmon’s approach, the data warehouse is physically implemented as a normalizeddata store. In Kimball’s approach, the data warehouse is physically implemented in a dimensional data store.

what is Business Intelligence?

Business intelligence is a collection of activities to understand business situations by performing various types of analysis on the company data as well as on external data from third parties to help make strategic, tactical, and operational business decisions and take necessary actions for improving business performance. This includes gathering, analyzing,understanding, and managing data about operation performance, customer and supplier activities, financial performance, market movements, competition, regulatory compliance, and quality controls

what is normalization?

Normalization is a process of removing data redundancy by implementing normalization
rules. There are five degrees of normal forms, from the first normal form to the fifth normal form.

list schemes used in implementaiton of Dimensional data store?

Dimensional data store(DDS) can be implemented physically in the form of several different
schemas.

Some examples of dimensional data store schemas are a

star schema
a snowflake schema,
and a galaxy schema.

In a star schema, a dimension does not have a subtable (a subdimension).

In a snowflake schema, a dimension can have a subdimension. The purpose of having a subdimension is to minimize redundant data.

A galaxy schema is also known as a fact constellation schema. In a galaxy schema, you have two or more related fact tables surrounded by common dimensions

what is dimensional data store (DDS)?

A DDS is a database that stores the data warehouse data in a different format than OLTP.

The reason for getting the data from the source system into the DDS and then querying the DDS instead of querying the source system directly is that in a DDS the data is arranged in a dimensional format that is more suitable for analysis.

The second reason is because a DDS contains integrated data from several source systems.

and we can take risk of querying source systems... source systems may go down

what is data profiler?

A data profiler is a tool that has the capability
to analyze data, such as finding out how many rows are in each table, how many rows
contain NULL values, and so on.

what is OLTP System?

Online Transaction Processing (OLTP) is a system whose main purpose is to capture
and store the business transactions

what is a multidimensional database?

A multidimensional database is a form of database where the data is stored in cells and
the position of each cell is defined by a number of hierarchical called dimensions. Each cell represents a business event, and the value of the dimensions shows when and where this event happened.

Examples of multidimensional databases are Microsoft Analysis Services, Hyperion
Essbase, and Cognos PowerCube. The other terms that people use for a multidimensional
database are hypercube, cube, OLAP cube, multidimensional database (abbreviated as MDB,
MDD, or MDDB), and multidimensional data store (MDDS).

example of Sequential file storage?

sequential file storage, such as ISAM

Example of Hierachical databases?

hierarchical databases, such as Adabas and ISM

what is an MPP system?

Massively parallel processing (MPP :

In MPP database systems, each node has its own memory, processor, and
disk. It is known as a share nothing architecture. An MPP database system is more powerful than systems with shared disks because data loading is happening in parallel across multiple nodes that have their own disks.

The main advantage of MPP database systems is that the performance increase is linear

ex:
In MPP database systems, if you put in ten nodes, the performance is
almost ten times the performance of a single node.

Example MPP databases?

Examples of MPP database systems are Teradata, Netezza, and Neoview

what is Extract, Load, and Transform (ELT)?

Extract, Load, and Transform (ELT):

In the ELT approach, we pull the data from the source systems, load it into the data warehouse, and then apply the transformation by updating the data in the warehouse. In the ELT approach, essentially we copy the source system (OLTP) data into the data warehouse and transform it there. People usually take the ETL approach if they have a strong ETL server and strong software that is rich with all kinds of transformation and data quality processe

People usually take the ELT approach if they have a strong data warehouse database system, usually MPP database systems. Massively parallel processing (MPP) is a group of servers (called nodes), and each node has its own memory, processor, and disk

what is leakage and recoverability

Leakage happens when the ETL process thinks it has downloaded all the data completely from the source system but in reality has missed some records. A good ETL process should not have any leakage.

Recoverability means that the ETL process should be robust so that in the event of a failure, it can recover without data loss or damage

What is ETL?

ETL stands for Extract, Transform, and Load. It is the process of retrieving and transforming data from the source system and putting it into the data warehouse

what is Partitioning?

There are two types of partitioning: vertical partitioning and horizontal partitioning.
Vertical partitioning is splitting a table vertically into several smaller tables, with each table containing some columns of the original table. Horizontal partitioning is splitting a table horizontally into several smaller tables, with each table containing some rows of the original table.

List some techniques which improves data warehouse performance?

Summary Tables

Table partitioning

Indexes

write types of fact tables?

we have three types of fact table:
the transactional fact table,
the periodic snapshot fact table, and
the accumulative snapshot fact table

what is RAID?

Redundant Array of Inexpensive Disks (RAID) is a method to configure several disks as one unit to make them more resilient by creating some physical data redundancy.

The popular RAID configurations (some people say RAID level) are
RAID 0,
RAID 0 (also known as a stripe set or striped volume) splits data evenly across two or more disks (striped) with no parity information for redundancy

RAID 1,
RAID 1 creates an exact copy (or mirror) of a set of data on two or more disks

RAID 5,
A RAID 5 uses block-level striping with parity data distributed across all member disks. RAID 5 has achieved popularity because of its low cost of redundancy

and RAID 1+0 (some people say RAID 10).

what is failover cluster is a configuration of SQL Server?

A failover cluster is a configuration of SQL Server installations on
several identical servers (nodes). The SQL Server database instances are running on an active
node, but when the active node is not available, the database instances automatically switches
over to a secondary node.

what is transaction table and master table?

A transaction table is a table that contains a business transaction or business event.

A master table is a table that contains the persons or objects involved in the business event.

what is transaction fact table?

A fact table that contains a collection of business events is called a transaction fact table.

what is periodic snapshot fact table?

A periodic snapshot fact table captures the state or condition of the business at a particular point in time

data warehouse dimensional models?

star schema model
snowflake schema model

for performance reasons as well as design simplicity and consistency, it is better to stick to a star schema rather than a snowflake, at least as much as we can.

A star schema is simpler and is more consistent than a snowflake schema because it has only one level in all dimensions. Because a star schema is simpler, it is easier for the ETL processes to load the data into it.

A snowflake schema is when you normalize a dimension into several smaller tables with a hierarchy structure. It is used when you want to benefit from less data redundancy.The benefit of a snowflake schema is that some analytics applications work better with a snowflake schema than a star schema.

The other benefit of a snowflake schema is that less disk space is required. A snowflake schema can decrease query performance, but it can also increase query performance. It decreases the fact table query performance because we have to join more tables to the fact table. It increases the query performance when we want to get the distinct value of a particular dimension attribute. If this attribute is normalized
into its own subdimension table, the select distinct query runs more quickly.

how about the rapidly changing dimension (RCD)?

If, on average, the rows in the dimension tables change once a month or more frequently, how do we store the historical value of dimensional attributes? The answer is that we store them in the fact table. Say we have a customer dimension with ten attributes. Attributes 1 to 9 change once a year. Attribute 10 changes every day. We should remove the 10th column from the customer dimension and put it as a degenerate dimension column in the fact table

A degenerate dimension doesn’t have a dimension key

How slow is slow?

Explain Slowly Changing Dimension(SCD) types?

SCD, is a technique used to store the historical value of dimension attributes.

The values of dimensional attributes change as time goes by. When these attribute values change, you can overwrite the old values with the new ones, or you can
preserve the old value. There are two methods of preserving the old attribute values: you can store the old values as rows, or you can store them as columns

three types of SCD:
• SCD type 1 overwrites the old values of the attributes so the old values are not kept.
Ex: In product dimension,we will use SCD type 1, meaning we will overwrite the old attribute values

• SCD type 2 keeps the old values by creating a new row for each change, just like
Table 5-6.
Ex: Customer dimention

• SCD type 3 keeps the old values by putting them in another column, just like Table 5-7.

Ex: store dimension is an ideal case for SCD type 3

SCD type 2 is more flexible for storing the historical value of dimensional
attributes, because you can store as many old versions as you want without altering the table structure. SCD type 3 is using columns to store old values, which is why it is not flexible. It is ideal for situations where you don’t have many old versions (five or fewer) and you know there will be only a certain number of versions. Type 3 is also ideal when the changes in this attribute
affect a large number of rows. In other words, a lot of dimension rows change the value of this attribute at the same time (simultaneously)

Why don’t we have a time dimension instead of a date dimension?

Well, it is not practical.
A date dimension with 10 years of data has 3,650 rows. If it were a time dimension with precision to the second, it would have 315 million rows and would impact the data warehouse performance because it’s used everywhere

what is Slowly Changing Dimension (SCD)?

SCD is a data modeling technique to store historical values of
dimensional attributes.

what is a dimension table?

A dimension table is a table that contains various attributes explaining the dimension key in the fact table

what is data mart?

A data mart is a group of related fact tables and their corresponding dimension
tables containing the measurements of business events, categorized by their dimensions. Data marts exist in dimensional data stores

A data mart (the fact table and its dimensions) stores a collection of business events in a certain business area. To design the fact tables and dimensions,
we need to have business knowledge in that business area

Ex: Product sales
subscriber profitability
supplier performance

do you indetify primary keys in data modeling?

determining primary keys and cluster indexes are part of the physical database design process

what is degenerate Dimension?

A degenerate dimension is a dimension with only one attribute,
and therefore the attribute is put in the fact table

What is Grain of fact table?

According to Ralph Kimball, it is important to declare the grain of the fact table. Grain is the
smallest unit of occurrence of the business event in which the event is measured. In other words, grain is completing this sentence: “One row in the fact table corresponds to. . . .”

Ex: For a fact table storing product sales, the grain is each item sold—one row in the Product Sales fact table corresponds to each item sold

Friday, September 10, 2010

Supporting OS for Informatica,data stage, Teradata

Informatica Powercenter 8.6
==========================

server :

Windows 2000 / 2003 / 2008
Sun Solaris
HP-UX
IBM AIX
Red Hat Linux
SUSE Linux

databases :

Domain:
IBM DB2 UDB
Microsoft SQL Server
Oracle
Sybase ASE

Repository:
IBM DB2 UDB
Microsoft SQL Server
Oracle
Sybase ASE

Client:

Windows 2000 / 2008 / XP
Windows Vista (PowerCenter Client only)

IMB Infosphere 8.1
================================

Server:
AIX
AIX 5.3 and 6.1

HP-UX
HP-UX 11i v2 on PA-RISC
HP-UX 11i v3 on PA-RISC
HP-UX 11i v2 on Intel Itanium
HP-UX 11i v3 on Intel Itanium

Linux
Red Hat Enterprise Linux Advanced Server 4 on AMD or Intel processors
Red Hat Enterprise Linux 5 Advanced Platform on AMD or Intel processors
Red Hat Enterprise Linux 5 Advanced Platform on IBM System z

SUSE Linux Enterprise Server 10 on AMD or Intel processors
SUSE Linux Enterprise Server 10 on IBM System z
Detailed system requirements Detailed system requirements
Solaris Solaris 9 and Solaris 10

Windows:

Server: Windows Server 2003 Service Pack 2

Client:

Client: Windows XP Service Pack 2, Windows Vista, Windows Server 2003 Service Pack 2

Teradata
=============================
Teradata Tools and Utilities :

IBM z/OS
UNIX and Linux
Microsoft Windows

Teradata Enterprise Data Warehouses are often accessed via ODBC, JDBC or via native support
by applications running on operating systems such as Microsoft Windows or flavors of UNIX.
The warehouse typically sources data from operational systems via a
combination of batch and trickle loads.

Enterprise edition runs in Teradata proprietary servers.
Teradata offers a choice of two operating systems as mentioned below:

* Microsoft Windows Server 2003
* SUSE Linux Enterprise Server on 64-bit Intel servers.