CPSC 343 Database Theory and Practice Fall 2017

# Homework 5: File Organization and Query Processing

## due Wed Dec 6 in class

### File Organization

The questions in this section make use of the following information:

Assume that a disk has block size B = 512 bytes. A block pointer is P = 6 bytes long.

The file has 50,000 fixed length EMPLOYEE records. Each EMPLOYEE record has the following fields: Name (30 bytes), Ssn (9 bytes), Department_code (9 bytes), Address (40 bytes), Phone (10 bytes), Birth_date (8 bytes), Sex (1 byte), Job_code (4 bytes), and Salary (4 bytes). Ssn is the primary key.

Show your work - organize your work clearly, and show the formula used for each calculation. (It should be clear where each part of your answer comes from.) No credit will be given for answers where it is not apparent where your answers come from, even if correct. If you need information that isn't provided, make a reasonable assumption and clearly state your assumption along with a brief rationale for why the assumption is reasonable.

1. Calculate the record size R in bytes.

2. Calculate the blocking factor bfr and the number of file blocks b, assuming an unspanned organization.

3. Suppose that the file is ordered by the key field Ssn and we want to construct a primary index on Ssn. Calculate:

1. the index blocking factor bfri
2. the number of first-level index entries and the number of first-level index blocks
3. the number of block accesses needed to search for and retrieve a record given its Ssn value using the (single-level) index
4. the number of levels needed if we make it into a multilevel index
5. the total number of blocks required by the multilevel index
6. the number of block accesses needed to search for and retrieve a record given its Ssn value using the multilevel index
4. Suppose the file is not ordered by the key field Ssn and we want to construct a secondary index on Ssn. Calculate:

1. the index blocking factor bfri
2. the number of first-level index entries and the number of first-level index blocks
3. the number of block accesses needed to search for and retrieve a record given its Ssn value using the (single-level) index
4. the number of levels needed if we make it into a multilevel index
5. the total number of blocks required by the multilevel index
6. the number of block accesses needed to search for and retrieve a record given its Ssn value using the multilevel index
5. Suppose that the file is ordered by the nonkey field Department_code and we want to construct a clustering index on Department_code. Assume there are 500 distinct values of Department_code and that the EMPLOYEE records are evenly distributed among these values. Calculate:

1. the index blocking factor bfri
2. the number of first-level index entries and the number of first-level index blocks
3. the number of block accesses needed to search for and retrieve all of the records with a particular Department_code using the (single-level) index
4. the number of levels needed if we make it into a multilevel index
5. the total number of blocks required by the multilevel index
6. the number of block accesses needed to search for and retrieve all of the records with a particular Department_code using the multilevel index
6. Suppose that the file is not ordered by the nonkey field Department_code and we want to construct a secondary index on Department_code using option 3 (one index record per distinct department code). Assume there are 500 distinct values of Department_code and that the EMPLOYEE records are evenly distributed among these values. Calculate:

1. the index blocking factor bfri
2. the number of first-level index entries and the number of first-level index blocks
3. the number of indirect blocks
4. the total number of blocks required by the first-level index, including indirect blocks
5. the number of block accesses needed to search for and retrieve all of the records with a particular Department_code using the (single-level) index
6. the number of levels needed if we make it into a multilevel index
7. the total number of blocks required by the multilevel index, including indirect blocks
8. the number of block accesses needed to search for and retrieve all of the records with a particular Department_code using the multilevel index
7. In light of your answers, is it worth creating an index in each of these situations? A multilevel index? Explain. Consider both search time and the extra space required to store the index.

### Query Processing

The questions in this section make use of the following:

```  STUDENT(name,id,gradyear,major,minor)
COURSE(dept,number,name,credithours)
SECTION(id,dept,number,term,instructor)
```

Key attributes are underlined. The foreign key constraints are:

• SECTION.dept and SECTION.number refer to COURSE.dept and COURSE.number, rspectively

Assume that a disk has block size B = 512 bytes. A block pointer is P = 6 bytes long. There are 4 blocks of memory available (nB).

STUDENT
number of rows12,500
data file sorted byid
column information
name46 bytes
id2 bytes
gradyear1 byteinteger values between 2015 and 2021
major4 bytes25 distinct values
minor4 bytes25 distinct values
COURSE
number of rows500
data file sorted bydept, number
column information
dept4 bytes25 distinct values
number2 bytesinteger values between 100 and 499
name61 bytes
credithours4 bytes8 distinct values
SECTION
number of rows10,000
data file sorted byid
column information
id3 bytes
dept4 bytes25 distinct values
coursenum2 bytesinteger values between 100 and 499
term3 bytes20 distinct values
instructor21 bytes200 distinct values
number of rows400,000
data file sorted bystudent, section
column information
student3 bytes12,500 distinct values
section3 bytes10,000 distinct values
(A+, A, A-, B+, B, B-, etc)
index# levels (x)# level 1 blocks (b1)

Unless otherwise indicated, assume that these indexes are the only ones available.

Show your work - organize your work clearly, and show the formula used for each calculation. (It should be clear where each part of your answer comes from.) No credit will be given for answers where it is not apparent where your answers come from, even if correct. If you need information that isn't provided, make a reasonable assumption and clearly state your assumption along with a brief rationale for why the assumption is reasonable.

1. Consider the following query:

```      SELECT S.name,S.gradyear,C.dept,C.number,S2.id,S2.term,G.grade
FROM STUDENT S, GRADES G, SECTION S2, COURSE C
WHERE S.id=G.student AND G.section=S2.id AND S2.dept=C.dept AND S2.number=C.number AND
C.dept='CPSC' AND C.credithours=4 AND G.grade LIKE 'A%'    ```
1. Draw the query tree corresponding most directly to the query as stated.

2. Apply heuristic optimization to your query tree from (a). Show your work - draw the query tree after each step of the heuristic optimization algorithm, label which step of the algorithm resulted in each tree, and provide a rationale for your actions in step 3.
2. Consider the following query tree:

1. Identify which of SL, SB, SH, SP, SC, and/or SS are applicable for the SELECT and which of JNL, JSL, and/or JSM are applicable for the JOIN in this query tree. Also provide a rationale for your answers.

2. Where there's a choice of algorithm, what choices do you expect to produce the best execution plan? Provide a rationale for your choices.

3. State the cost of your execution plan from (b) in terms of the number of blocks read/written, both with and without pipelining.

1. Apply heuristic query optimization to the query tree from the previous problem to get a different (but equivalent) query tree. Show your work - draw the query tree after each step of the heuristic optimization algorithm, label which step of the algorithm resulted in each tree, and provide a rationale for your actions in step 3.

2. Identify which of SL, SB, SH, SP, SC, and/or SS are applicable for the SELECT and which of JNL, JSL, and/or JSM are applicable for the JOIN in this query tree. Also provide a rationale for your answers.

3. Where there's a choice of algorithm, what choices do you expect to produce the best execution plan? Provide a rationale for your choices.

4. State the cost of your execution plan from (c) in terms of the number of blocks read/written, both with and without pipelining.

3. Suppose that an index is available on SECTION.coursenum. Could this index be used to develop a better execution plan for the query tree from #9? What about the optimized query tree from #10? If so, explain how the index is helpful; if not, explain why not.

 last updated: --Wed Nov 29 13:29:18 EST 2017-- page owned by: bridgeman@hws.edu