Wednesday, December 22, 2010

Google's AROUND Operator for Proximity Search

via Google Operating System by Alex Chitu on 12/8/10

Google has an undocumented operator called "AROUND" for finding web pages that include words or phrases which are near to each other.

If you want to find results that include both "Steve Jobs" and "Andy Rubin", you might search for ["Steve Jobs" "Andy Rubin"] or even for ["Steve Jobs * Andy Rubin"]. Google's AROUND operator lets you specify the maximum number of words that separate the two names. For example, you could search for ["Steve Jobs" AROUND(3) "Andy Rubin"] and only get web pages that include the two names separated by less than three words.


"The AROUND operator is a handy trick to use when you're looking for a combination of search terms when one dominates the results, but you're interested in the relationship between two query terms. Note also that if Google can't find anything within the limit, it will just do regular ranking of the terms without the AROUND coming into play. Using AROUND is especially useful when the documents are rather long (think book-length articles). So try this operator in Google Books.... [slavery AROUND(4) indigo]," suggests Google's Daniel Russell.

Barry Schwartz notes that Bing has a similar operator, but it's called "near".

{ via Search Engine Roundtable }

Tuesday, December 21, 2010

Joining all lines in a file with comma

awk '{ printf "%s," ,$0 }' file.txt > newfile.txt

Monday, December 20, 2010

Mysql db count tables

select count(*) as number_of_tables 
from information_schema.tables 
where table_schema = 'my favourite schema' 


Saturday, December 11, 2010

Find the minimum difference between two arrays

Given two sorted arrays, A and B, find i,j for which |A[i] - B[j]| is minimum

Longest sequence of d-dimensional boxes fitting into each other


Interview question :
You have 'n' d-dimensional boxes.
You have to find out the longest sequence
of boxes which fit into each other.

Hints :
1. How do you say that whether a 3-d box
fits into another?
2. Extend it to d-dimensions.
3. It's possible to do a bfs on a directed-acyclic-graph
without keeping a 'visited' lookup table, since it will
never go in an infinite loop.
4. Finding longest path in a graph is NP-complete.

Encoding/decoding a tree


How do you encode a tree, send it over
the wire and decode it.
Give algorithms for encoding/decoding.

Location of an element where index is same as the value


In a sorted integer array,

find an index, for which the
value is same as the index, i.e. a[i] is same as i.

For e.g. in a = {-2,0,2,3}
a[2] is 2, so the answer is 2.

In {1,2,5,7}, there is no such index.

My answer is on the lines of binary-search,
but would also like to see other answers.

Locker design interview question


You have to design a locker system where people
can deposit their bags of 3 types : small,medium,big.
And there are 3 types of lockers : small,medium,big.
Let's say there are 'n' lockers of each type.

Also, a small bag goes in a small locker,
a big bag in the big locker and so on.

You need to support two operations.
1. deposit a bag
2. retrieve a bag(when the depositor comes back)

interview questions

find the height of a binary tree, iteratively

Tuesday, December 7, 2010

Interview question

I have 6 million co-ordinates of the type (x,y).
I know the bounding rectangle for those points,
that is I know the min(x),min(y),max(x),max(y)
on the complete range.

Now, write a function closestTen(x,y), which
will take a (x,y) point as input and find its 10 closest
neighbors in the given bounding rectangle.

This input (x,y) can be inside the bounding rectangle
or outside it. It may also be one of the 6 million points,
but not necessarily.

Some Interview questions

1. Given a sorted array which has distinct integers in ascending order,
find an index where a[k] = k;
Solution : modified binary search

2. Give me the data structure and algorithm for a locker system
where there are 3 types of lockers big,medium and small. Number of
lockers is n for every type.
Customers come with 3 types of bags, big,medium and small.
Small bag goes in small locker, and so on...

You have to support two operations :
a. insert bag,
b. retrieve bag.

For both of the operations, the expected time complexity is O(1).

3. Given a binary search tree, find two elements which sum up to k.
Solution : convert it into a doubly linked list and then solve the problem
as if you are solving it for an array.

Monday, December 6, 2010

php echo print difference

echo is faster, since it doesn't return anything as opposed
to print

PHP how to get the browser information

through $_SERVER[], a parameter called HTTP_USER_AGENT,
may be coupled with get_browser method.

GET POST difference

1. POST is superset of get, that is it reads data from querystring as
well as the request body,
whereas GET reads data only from the queryString.
2. Results of GET can be cached, not POST.
3. Sending passwords by GET can expose your password to :
(a) a casual observer looking over your shoulder
(b) someone reading server logs
(c) in referrer information
4. Through POST, you can send more data.
5. resubmitting a POST request prompts(should prompt) user.
6. GET or query string posts are really good for information required
for either bookmarking a particular item, or for assisting in search
engine optimization and indexing items.

Saturday, December 4, 2010

BST in PHP

<pre>
<?php
define(l,'l');
define(r,'r');

$root['data'] = 22;
$root = tree_insert(&$root,20);
$root = tree_insert(&$root,24);
$root = tree_insert(&$root,23);
$root = tree_insert(&$root,23.5);

$root = tree_insert(&$root,19);
$root = tree_insert(&$root,21);
$root = tree_insert(&$root,20.5);
$root = tree_insert(&$root,21.5);

$root = tree_insert(&$root,20.25);
$root = tree_insert(&$root,20.75);
$root = tree_insert(&$root,18);
$root = tree_insert(&$root,19.5);
$root = tree_insert(&$root,18.5);

print_r($root);

function tree_insert($root,$data) {
    if($root) {
        if($root['data'] < $data) {
            $root[r] = tree_insert(&$root[r],$data);
        } else {
            $root[l] = tree_insert(&$root[l],$data);
        }
    } else {
        $root['data'] = $data;
        return $root;
    }
    return $root;
}

?>
</pre>

Web Security issues

1. Cross Site Request Forgery : (CSRF)

Types 
A.
Change the DNS entry for the server in the ADSL of the user for a particular domain,
so that the request goes to a different domain altogether

B.
For e.g. a URL in one of the forums which Bob uses : 
<img src="http://bank.example.com/withdraw?account=bob&amount=1000000&for=mallory">
If Bob's cookies haven't expired, this will benefit mallory.

Prevention :
1. Add authentication token in GET/POST
2. Check referrer
3. secret token
4. crossdomain.xml


2. Replay attack :
Evasdrop on a client server communication and re-play it.
Prevention : 
Server should the client a one-time use token.
Other Issues : 

3. Cross site cooking / Cross sub domain cooking (allowing a web site to set cookies for other domains/sub domains)
then, if the affected person logs on, evil person can use that cookie
General solutions : 
generate new SId, before something crucial.
destroy session for malicious referrer
check browser etc information
time out old SIDs.

4. A billion laughs : 

<!DOCTYPE root [
<!ENTITY ha "Ha! ">
<!ENTITY ha2 "&ha;&ha;">
<!ENTITY ha3 "&ha2;&ha2;">
<!ENTITY ha4 "&ha3;&ha3;">
<!ENTITY ha5 "&ha4;&ha4;">
...
<!ENTITY ha128 "&ha127;&ha127;">
]>
<root>&ha128;</root>

5. Similar to a Billion laughs : 
ReDos (Regular expression Denial of Service)

Cross site scripting attack (Google example)


The script (http://www.google.com/url?q=...) is normally used for redirecting the browser from Google's website to other sites.

For example, the following request will redirect the browser to http://www.watchfire.com:
http://www.google.com/url?q=http://www.watchfire.com 

When the parameter (q) is passed to the script with illegal format (The format seems to be: http://domain), a "403 Forbidden" page returns to the user, informing that the query was illegal. The parameter's value appears in the HTML returned to the user.

If http://www.google.com/url?q=USER_INPUT is requested, the text in the "403 Forbidden" response would be:
"Your client does not have permission to get URL /url?q=USER_INPUT from this server."

The server response lacks charset encoding enforcement, such as:
* Response headers: "Content-Type: text/html; charset=[encoding]".
* Response body: "<meta http-equiv="Content-Type" (...) charset=[encoding]/>".

Google's 404 NOT FOUND mechanism:
When requesting a page which doesn't exist under www.google.com, a 404 NOT FOUND response is returned to the user, with the original path requested.

If http://www.google.com/NOTFOUND is requested, the following text appears in the response:
"Not Found The requested URL /NOTFOUND was not found on this server."

The server response lacks charset encoding enforcement, such as:
* Response headers: "Content-Type: text/html; charset=[encoding]".
* Response body: "<meta http-equiv="Content-Type" (...) charset=[encoding]/>".

XSS vulnerabilities:
While the aforementioned mechanisms (URL redirection script, 404 NOT FOUND) escape common characters used for XSS, such as <> (triangular parenthesis) and apostrophes, it fails to handle hazardous UTF-7 encoded payloads.

Therefore, when sending an XSS attack payload, encoded in UTF-7, the payload will return in the response without being altered.

For the attack to succeed (script execution), the victims browser should treat the XSS payload as UTF-7.

IE charset encoding Auto-Selection:
If 'Encoding' is set to 'Auto-Select', and Internet-Explorer finds a UTF-7 string in the first 4096 characters of the response's body, it will set the charset encoding to UTF-7 automatically, unless a certain charset encoding is already enforced.

This automatic encoding selection feature makes it possible to mount UTF-7 XSS attacks on Google.com.

Solution:
Google solved the aforementioned issues at 01/12/2005, by using character encoding enforcement.


Friday, December 3, 2010

Some db performance optimizations

1. Materialized views : 
Views where the results are cached, and updated time-to-time.

2. Save the result of aggregate operations, and refresh periodically ( or use triggers)
(for e.g. stackoverflow updates your accept rate, not immediately, but after some time)

3. Vertically split tables (by columns), keep the primary key column in both the partitions.
If the original table is being maintained, no need to carry over the RI constraints.

4. Redundant data
table1(empid,empname,deptId) table2(deptId,deptName)
if empname,deptName are usually fetched together, ponder
copying the deptName to the first table.

5. While dealing with repeating groups, you can consider having multiple columns
rather than multiple rows in order fasten the access.
but before doing this know that,
-> it will be difficult to aggregate the data
-> it will be accessed collectively


Data Denormalization guidelines

this page mirrored from the The Data Adminstration Newsletter www.tdan.com. Please visit them for more.

DENORMALIZATION GUIDELINES
Craig S. Mullins, PLATINUM 
technology, inc.

Normalization is the process of putting one fact in one appropriate place. This optimizes updates at the expense of retrievals. When a fact is stored in only one place, retrieving many different but related facts usually requires going to many different places. This tends to slow the retrieval process. Updating is quicker, however, because the fact you're updating exists in only one place.

It is generally recognized that all relational database designs should be based on a normalized logical data model. With a normalized data model, one fact is stored in one place, related facts about a single entity are stored together, and every column of each entity refers non-transitively to only the unique identifier for that entity. Although an in-depth discussion of normalization is beyond the scope of this article, brief definitions of the first three normal forms follow:

  • In first normal form, all entities must have a unique identifier, or key, that can be composed of one or more attributes. In addition, all attributes must be atomic and non-repeating. (Atomic means that the attribute must not be composed of multiple attributes. For example, EMPNO should not be composed of social security number and last name because these are separate attributes.)
  • In second normal form, all attributes that are not part of the key must depend on the entire key for that entity.
  • In third normal form, all attributes that are not part of the key must not depend on any other non-key attributes.

Frequently, however, performance needs dictate very quick retrieval capability for data stored in relational databases. To accomplish this, sometimes the decision is made to denormalize the physical implementation. Denormalization is the process of putting one fact in numerous places. This speeds data retrieval at the expense of data modification.

It is not the intention of this article to promote the concept of denormalization. Of course, a normalized set of relational tables is the optimal environment and should be implemented for whenever possible. Yet, in the real world, denormalization is sometimes necessary. Denormalization is not necessarily a bad decision if implemented wisely. You should always consider these issues before denormalizing:

  • can the system achieve acceptable performance without denormalizing?
  • will the performance of the system after denormalizing still be unacceptable?
  • will the system be less reliable due to denormalization?

If the answer to any of these questions is "yes," then you should avoid denormalization because any benefit that is accrued will not exceed the cost. If, after considering these issues, you decide to denormalize be sure to adhere to the general guidelines that follow.

If enough DASD is available at your shop, create two sets of tables: one set fully normalized and another denormalized. Populate the denormalized versions by querying the data in the normalized tables and loading or inserting it into the denormalized tables. Your application can access the denormalized tables in a read-only fashion and achieve performance gains. It is imperative that a controlled and scheduled population function is maintained to keep the data in the denormalized and normalized tables synchronized.

If DASD is not available for two sets of tables, then maintain the denormalized tables programmatically. Be sure to update each denormalized table representing the same entity at the same time, or alternately, to provide a rigorous schedule whereby tables will be synchronized. At any rate, all users should be informed of the implications of inconsistent data if it is deemed impossible to avoid unsynchronized data.

When updating any column that is replicated in many different tables, always update it everywhere that it exists simultaneously, or as close to simultaneously as possible given the physical constraints of your environment. If the denormalized tables are ever out of sync with the normalized tables be sure to inform end-users that batch reports and on-line queries may not contain sound data; if at all possible, this should be avoided.

Finally, be sure to design the application so that it can be easily converted from using denormalized tables to using normalized tables.

The Reason for Denormalization

Only one valid reason exists for denormalizing a relational design - to enhance performance. However, there are several indicators which will help to identify systems and tables which are potential denormalization candidates. These are:

  • Many critical queries and reports exist which rely upon data from more than one table. Often times these requests need to be processed in an on-line environment.
  • Repeating groups exist which need to be processed in a group instead of individually.
  • Many calculations need to be applied to one or many columns before queries can be successfully answered.
  • Tables need to be accessed in different ways by different users during the same timeframe.
  • Many large primary keys exist which are clumsy to query and consume a large amount of DASD when carried as foreign key columns in related tables.
  • Certain columns are queried a large percentage of the time. Consider 60% or greater to be a cautionary number flagging denormalization as an option.

Be aware that each new RDBMS release usually brings enhanced performance and improved access options that may reduce the need for denormalization. However, most of the popular RDBMS products on occasion will require denormalized data structures. There are many different types of denormalized tables which can resolve the performance problems caused when accessing fully normalized data. The following topics will detail the different types and give advice on when to implement each of the denormalization types.

Pre-Joined Tables

If two or more tables need to be joined on a regular basis by an application, but the cost of the join is prohibitive, consider creating tables of pre-joined data. The pre-joined tables should:

  • contain no redundant columns (matching join criteria columns)
  • contain only those columns absolutely necessary for the application to meet its processing needs
  • be created periodically using SQL to join the normalized tables

The cost of the join will be incurred only once when the pre-joined tables are created. A pre-joined table can be queried very efficiently because every new query does not incur the overhead of the table join process.

Report Tables

Often times it is impossible to develop an end-user report using SQL or QMF alone. These types of reports require special formatting or data manipulation. If certain critical or highly visible reports of this nature are required to be viewed in an on-line environment, consider creating a table that represents the report. This table can then be queried using SQL, QMF, and/or another report facility. The report should be created using the appropriate mechanism (application program, 4GL, SQL, etc.) in a batch environment. It can then loaded into the report table in sequence. The report table should:

  • contain one column for every column of the report
  • have a clustering index on the columns that provide the reporting sequence
  • not subvert relational tenets (such as, 1NF and atomic data elements)

Report tables are ideal for carrying the results of outer joins or other complex SQL statements. If an outer join is executed and then loaded into a table, a simple SELECT statement can be used to retrieve the results of the outer join, instead of the complex UNION technique shown in Figure 1. Some RDBMS products support an explicit outer join function which can be used instead of the UNION depicted. However, depending on the implementation, the explicit outer join may be simpler or more complex than the UNION it replaces.

Figure 1. Outer Join Technique Using UNION

Mirror Tables

If an application system is very active it may be necessary to split processing into two (or more) distinct components. This requires the creation of duplicate, or mirror tables. Consider an application system that has very heavy on-line traffic during the morning and early afternoon hours. This traffic consists of both querying and updating of data. Decision support processing is also performed on the same application tables during the afternoon. The production work in the afternoon always seems to disrupt the decision support processing causing frequent time outs and dead locks.

This situation could be corrected by creating mirror tables. A foreground set of tables would exist for the production traffic and a background set of tables would exist for the decision support reporting. A mechanism to periodically migrate the foreground data to background tables must be established to keep the application data synchronized. One such mechanism could be a batch job executing UNLOAD and LOAD utilities. This should be done as often as necessary to sustain the effectiveness of the decision support processing.

It is important to note that since the access needs of decision support are often considerably different than the access needs of the production environment, different data definition decisions such as indexing and clustering may be chosen for the mirror tables.

Split Tables

If separate pieces of one normalized table are accessed by different and distinct groups of users or applications then consider splitting the table into two (or more) denormalized tables; one for each distinct processing group. The original table can also be maintained if other applications exist that access the entire table. In this scenario the split tables should be handled as a special case of mirror table. If an additional table is not desired then a view joining the tables could be provided instead.

Tables can be split in one of two ways: vertically or horizontally. Refer to Figure 2. A vertical split cuts a table column-wise, such that one group of columns is placed into one new table and the remaining columns are placed in another new table. A horizontally split table is a row-wise split. To split a table horizontally, rows are classified into groups via key ranges. The rows from one key range are placed in one table, those from another key range are placed in a different table, and so on.

Vertically split tables should be created by placing the primary key columns for the old, normalized table into both of the split tables. Designate one of the two, new tables as the parent table for the purposes of referential integrity unless the original table still exists. In this case, the original table should be the parent table in all referential constraints. If this is the case, and the split tables are read only, do not set up referential integrity (RI) for the split tables as they are being derived from a referentially intact source. RI would be redundant.

When a vertical split is being done, always include one row per primary key in each split table. Do not eliminate rows from either of the two tables for any reason. If rows are eliminated the update process and any retrieval process that must access data from both tables will be unnecessarily complicated.

When a horizontal split is being done, try to split the rows between the new tables to avoid duplicating any one row in each new table. This is done by splitting using the primary key such that discrete key ranges are placed in separate split tables. Simply stated, the operation of UNION ALL, when applied to the horizontally split tables, should not add more rows than contained in the original, un-split tables. Likewise, it should not contain fewer rows either.

Combined Tables

If tables exist with a one-to-one relationship consider combining them into a single combined table. Sometimes, even one-to-many relationships can be combined into a single table, but the data update process will be significantly complicated because of the increase in redundant data.

For example, consider an application with two tables: DEPT (containing department data) and EMP (containing employee data). Combining the two tables into a large table named, for example, EMP_WITH_DEPT. This new table would contain all of the columns of both tables except for the redundant DEPTNO column (the join criteria). So, in addition to all of the employee information, all of the department information would also be contained on each employee row. This will result in many duplicate instances of the department data. Combined tables of this sort should be considered pre-joined tables and treated accordingly. Tables with one to one relationships should always be analyzed to determine if combination is useful.

Redundant Data

Sometimes one or more columns from one table are accessed whenever data from another table is accessed. If these columns are accessed frequently with tables other than those in which they were initially defined, consider carrying them in those other tables as redundant data. By carrying these additional columns, joins can be eliminated and the speed of data retrieval will be enhanced. This should only be attempted if the normal access is debilitating.

Consider, once again, the DEPT and EMP tables. If most of the employee queries require the name of the employee's department then the department name column could be carried as redundant data in the EMP table. The column should not be removed from the DEPT table, though (causing additional update requirements if the department name changes).

In all cases columns that can potentially be carried as redundant data should be characterized by the following attributes:

  • only a few columns are necessary to support the redundancy
  • the columns should be stable, being updated only infrequently
  • the columns should be used by either a large number of users or a few very important users

Repeating Groups

When repeating groups are normalized they are implemented as distinct rows instead of distinct columns. This usually results in higher DASD usage and less efficient retrieval because there are more rows in the table and more rows need to be read in order to satisfy queries that access the repeating group.

Sometimes, denormalizing the data by storing it in distinct columns can achieve significant performance gains. However, these gains come at the expense of flexibility. For example, consider an application that is storing repeating group information in the normalized table below:

This table can store an infinite number of balances per customer, limited only by available storage and the storage limits of the RDBMS. If the decision were made to string the repeating group, BALANCE, out into columns instead of rows, a limit would need to be set for the number of balances to be carried in each row. An example of this after denormalization is shown below:

In this example, only six balances may be stored for any one customer. The number six is not important, but the concept that the number of values is limited is important. This reduces the flexibility of data storage and should be avoided unless performance needs dictate otherwise.

Before deciding to implement repeating groups as columns instead of rows be sure that the following criteria are met:

  • the data is rarely or never aggregated, averaged, or compared within the row
  • the data occurs in a statistically well-behaved pattern
  • the data has a stable number of occurrences
  • the data is usually accessed collectively
  • the data has a predictable pattern of insertion and deletion

If any of the above criteria are not met, SQL SELECT statements may be difficult to code making the data less available due to inherently unsound data modeling practices. This should be avoided because, in general, data is denormalized only to make it more readily available.

Derivable Data

If the cost of deriving data using complicated formulae is prohibitive then consider storing the derived data in a column instead of calculating it. However, when the underlying values that comprise the calculated value change, it is imperative that the stored derived data also be changed otherwise inconsistent information could be reported. This will adversely impact the effectiveness and reliability of the database.

Sometimes it is not possible to immediately update derived data elements when the columns upon which they rely change. This can occur when the tables containing the derived elements are off-line or being operated upon by a utility. In this situation, time the update of the derived data such that it occurs immediately when the table is made available for update. Under no circumstances should outdated derived data be made available for reporting and inquiry purposes.

Hierarchies

A hierarchy is a structure that is easy to support using a relational database such as DB2, but is difficult to retrieve information from efficiently. For this reason, applications which rely upon hierarchies very often contain denormalized tables to speed data retrieval. Two examples of these types of systems are the classic Bill of Materials application and a Departmental Reporting system. A Bill of Materials application typically records information about parts assemblies in which one part is composed of other parts. A Department Reporting system typically records the departmental structure of an organization indicating which departments report to which other departments.

A very effective way to denormalize a hierarchy is to create what are called "speed" tables. Figure 3 depicts a department hierarchy for a given organization. The hierarchic tree is built such that the top most node is the entire corporation and each of the other nodes represents a department at various levels within the corporation. In our example department 123456 is the entire corporation. Departments 1234 and 56 report directly to 123456. Departments 12, 3, and 4 report directly to 1234 and indirectly to department 123456. And so on.

The table shown under the tree in Figure 3 is the classic relational implementation of a hierarchy. There are two department columns, one for the parent and one for the child. This is an accurately normalized version of this hierarchy containing everything that is represented in the diagram. The complete hierarchy can be rebuilt with the proper data retrieval instructions.

Figure 3. Classic Relational Implementation of a Department Hierarchy

Even though the implementation effectively records the entire hierarchy, building a query to report all of the departments under any other given department can be time consuming to code and inefficient to process. Figure 4 shows a sample query that will return all of the departments that report to the corporate node 123456. However, this query can only be built if you know in advance the total number of possible levels the hierarchy can achieve. If there are n levels in the hierarchy then you will need n-1UNIONs.

Figure 4. Querying the Departmental Hierarchy

A "speed" table can be built such as the one in Figure 5. This table depicts the parent department and all of the departments under it regardless of the level. Contrast this to the previous table which only recorded immediate children for each parent. A "speed" table also commonly contains other pertinent information that is needed by the given application. Typical information includes the level within the hierarchy for the given node, whether or not the given node of the hierarchy is a detail node (at the bottom of the tree), and, if ordering within level is important, the sequence of the nodes at the given level.

Figure 5. Speed Table Implementation of a Departmental Hierarchy

After the "speed" table has been built, speedy queries can be written against this implementation of a hierarchy. Figure 6 shows various informative queries that would have been very inefficient to execute against the classical relational hierarchy. These queries work for any number of levels between the top and bottom of the hierarchy.

A "speed" table is commonly built using a program written in COBOL or another high level language. SQL alone is usually either too inefficient to handle the creation of a "speed" table or impractical because the number of levels in the hierarchy is either unknown or constantly changing.

Figure 6. Querying the Speed Table

Types of Denormalization

We have discussed nine different types of denormalization. The table below will summarize the types of denormalization that are available with a short description of when this type of denormalization is useful.

Summary

The decision to denormalize should never be made lightly because it involves a lot of administrative dedication. This dedication takes the form of documenting the denormalization decisions, ensuring valid data, scheduling of data migration, and keeping end users informed about the state of the tables. In addition, there is one more category of administrative overhead: periodic analysis.

Whenever denormalized data exists for an application the data and environment should be periodically reviewed on an on-going basis. Hardware, software, and application requirements will evolve and change. This may alter the need for denormalization. To verify whether or not denormalization is still a valid decision ask the following questions:

Have the processing requirements changed for the application such that the join criteria, timing of reports, and/or transaction throughput no longer require denormalized data?

Did a new DBMS release change performance considerations? For example, did the introduction of a new join method undo the need for pre-joined tables?

Did a new hardware release change performance considerations? For example, does the upgrade to a new high-end processor reduce the amount of CPU such that denormalization is no longer necessary? Or did the addition of memory enable faster data access enabling data to be physically normalized?

In general, periodically test whether the extra cost related to processing with normalized tables justifies the benefit of denormalization. You should measure the following criteria:

  • I/O saved
  • CPU saved
  • complexity of update programming
  • cost of returning to a normalized design

It is important to remember that denormalization was initially implemented for performance reasons. If the environment changes it is only reasonable to re-evaluate the denormalization decision. Also, it is possible that, given a changing hardware and software environment, denormalized tables may be causing performance degradation instead of performance gains.

Simply stated, always monitor and periodically re-evaluate all denormalized applications.

Craig S. Mullins is Vice President of Marketing and Operations for the database tools division of PLATINUM technology, inc. He has extensive experience in all facets of database systems development, including developing and teaching database and data modeling classes, systems analysis and design, database and system administration, and data analysis. Craig has worked with DB2 since V1 and has experience in multiple roles including programmer, DBA, instructor, and analyst. His experience spans industries having worked in manufacturing, banking, utilities, commercial software development, consulting and as a computer industry analyst for the Gartner Group. Additionally, Craig authored the popular DB2 Developer's Guide which provides over 1100 pages of tips, techniques, and guidelines to optimize DB2 for MVS.

Craig is also a frequent contributor to computer industry publications having over five dozen articles published during the past few years. His articles have been published in magazines like Byte, DB2 Update, Database Programming & Design, DBMS, Data Management Review, Relational Database Journal, Enterprise Systems Journal, Mainframe Client/Server and others.

Craig graduated cum laude with a degree in Computer Science and Economics from the University of Pittsburgh.

[The Article Archive]

The Data Administration Newsletter
Robert S. Seiner - Publisher and Editor - rseiner@tdan.com

Interview question

1. Find two numbers in a binary search tree
which add up to X.

Solution : Convert it to doubly linked list, using recursion : O(lgn) space,
and have head tail pointers like what you would have done for a sorted array.

Some more discussion .

2. What are prepared statements in MySql?

3. table (empname,designation,mgrId,empId)
find employee name, manager name

4. What if the load balancer fails, what is the back up strategy?

5. What happens when you hit google.com in the browser.


generating all the possible permutations in actionscript/flex

<?xml version="1.0" encoding="utf-8"?>
<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml" layout="absolute" creationComplete="main()">
<mx:Script>
<![CDATA[
import mx.collections.ArrayCollection;
private function main():void {
var ac1:ArrayCollection = new ArrayCollection();
ac1.addItem(1);
ac1.addItem(2);
ac1.addItem(3);
ac1.addItem(4);
var output:ArrayCollection = permute(ac1);
for(var i:int = 0; i<output.length; ++i) {
var tmpAc:ArrayCollection = output.getItemAt(i) as ArrayCollection;
var singleLine:String = "";
for(var j:int = 0; j<tmpAc.length; ++j) {
singleLine += String(tmpAc[j])+" ";
}
trace(singleLine);
}
}
private function permute(inputListOfNumbers:ArrayCollection):ArrayCollection {
var output:ArrayCollection = new ArrayCollection();
if(inputListOfNumbers.length == 2) {
var ac1:ArrayCollection = new ArrayCollection();
ac1.addItem(inputListOfNumbers.getItemAt(0));
ac1.addItem(inputListOfNumbers.getItemAt(1));
output.addItem(ac1);
var ac1:ArrayCollection = new ArrayCollection();
ac1.addItem(inputListOfNumbers.getItemAt(1));
ac1.addItem(inputListOfNumbers.getItemAt(0));
output.addItem(ac1);
return output;
}
for(var i:int = 0; i<inputListOfNumbers.length; ++i) {
var otherThanIth:ArrayCollection = getElementsOtherThanIth(inputListOfNumbers,i);
var tmpRes:ArrayCollection = 
prependNumber(inputListOfNumbers.getItemAt(i) as int,permute(otherThanIth));
for(var j:int = 0; j<tmpRes.length; ++j) {
output.addItem(tmpRes.getItemAt(j));
}
}
return output;
}
private function prependNumber(i:int,ac1:ArrayCollection):ArrayCollection {
for(var j:int = 0; j<ac1.length; ++j) {
var tmpAc:ArrayCollection = ac1.getItemAt(j) as ArrayCollection;
tmpAc.addItemAt(i,0);
}
return ac1;
}
private function getElementsOtherThanIth(ac1:ArrayCollection,i:int):ArrayCollection {
var output:ArrayCollection = new ArrayCollection();
for(var j:int = 0; j<ac1.length; ++j) {
if(i != j) {
output.addItem(ac1.getItemAt(j));
}
}
return output;
}
]]>
</mx:Script>
</mx:Application>

Thursday, December 2, 2010

Dijkstra's_algorithm in PHP


<html><body><pre>
<?php
$d['a']['b'] = 10;
$d['a']['c'] = 2;
$d['b']['d'] = 1;
$d['c']['d'] = 1;
$d['c']['e'] = 7;
$d['d']['e'] = 20;
$d['e']['f'] = 9;

$startNode = 'a';

$nodeStack = array();
array_push($nodeStack,$startNode);
while(count($nodeStack) > 0) {
$curr = array_pop($nodeStack);
foreach($d[$curr] as $v3 => $d3) {
array_push($nodeStack,$v3);
if($d[$startNode][$v3]) {
$newDist = $d[$startNode][$curr] + $d[$curr][$v3];
if($d[$startNode][$v3] > $newDist)
$d[$startNode][$v3] = $newDist;
} else {
$d[$startNode][$v3] = $d[$startNode][$curr] + $d[$curr][$v3];
}
}
}

print_r($d[$startNode]);
?>
</pre></body></html>

Yahoo Interview Questions

1. Find the combined median of two sorted arrays.
2. How to you implement the T9 word suggestion/spell correction on mobile phones.
3. Given a lot of (x,y) points on Cartesian plain, draw a straight line such that
sum of (length of the line) + (perpendiculars on the line from the points) is minimum.
4. How do you search an element in a 2-D array, where every row and every column is sorted.
5. Detect the starting of the loop in a linked list
6. There is an array with 1 million numbers, having only 0,1,2. How do you sort it?
What if the elements are complex objects having a field type with values 0,1,2.

Binary Search C recursive and iterative

int arrayBinSearchIter(int* arr,int val,int len) {
int min = 0;
int max = len - 1;
int curr = -1;
while(min <= max) {
curr = (min+max)/2;
if(arr[curr] == val) return(curr);
if(arr[curr] < val) min = curr + 1;
if(arr[curr] > val) max = curr - 1;
}
return(-1);
}

int arrayBinSearchRec(int* arr,int val,int min,int max) {
if(min > max) return(-1);
int curr = (min+max)/2;
if(arr[curr] == val) return curr;
if(arr[curr] < val) min = curr + 1;
if(arr[curr] > val) max = curr - 1;
return arrayBinSearchRec(arr,val,min,max);
}

Wednesday, December 1, 2010

Flex Interview Questions

1. What are synchronous/asynchronous errors and how do you handle them?
2. HttpService call is sync or sync?
3. Parent class of Image?
4. How do you add a Sprite to Canvas?
5. What is HTMLWrapper?
6. What is the compiler argument of specifying targeted flash player version?
7. How do you do pagination for datagrids?

Blog Archive