8.7 The following table consists of training data from an employee database. The data havebeen generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35.For a given row entry, count represents the number of data tuples having the values fordepartment, status, age, and salary given in that row.department status age salary countsales senior 31 . . . 35 46K . . . 50K 30sales junior 26 . . . 30 26K . . . 30K 40sales junior 31 . . . 35 31K . . . 35K 40systems junior 21 . . . 25 46K . . . 50K 20systems senior 31 . . . 35 66K . . . 70K 5systems junior 26 . . . 30 46K . . . 50K 3systems senior 41 . . . 45 66K . . . 70K 3marketing senior 36 . . . 40 46K . . . 50K 10marketing junior 31 . . . 35 41K . . . 45K 4secretary senior 46 . . . 50 36K . . . 40K 4secretary junior 26 . . . 30 26K . . . 30K 6Let status be the class label attribute.(a) How would you modify the basic decision tree algorithm to take into considerationthe count of each generalized data tuple (i.e., of each row entry)?(b) Use your algorithm to construct a decision tree from the given data.(c) Given a data tuple having the values “systems,” “26 . . . 30,” and “46–50K” for theattributes department, age, and salary, respectively, what would a na¨ıve Bayesianclassification of the status for the tuple be?
Question
8.7 The following table consists of training data from an employee database. The data havebeen generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35.For a given row entry, count represents the number of data tuples having the values fordepartment, status, age, and salary given in that row.department status age salary countsales senior 31 . . . 35 46K . . . 50K 30sales junior 26 . . . 30 26K . . . 30K 40sales junior 31 . . . 35 31K . . . 35K 40systems junior 21 . . . 25 46K . . . 50K 20systems senior 31 . . . 35 66K . . . 70K 5systems junior 26 . . . 30 46K . . . 50K 3systems senior 41 . . . 45 66K . . . 70K 3marketing senior 36 . . . 40 46K . . . 50K 10marketing junior 31 . . . 35 41K . . . 45K 4secretary senior 46 . . . 50 36K . . . 40K 4secretary junior 26 . . . 30 26K . . . 30K 6Let status be the class label attribute.(a) How would you modify the basic decision tree algorithm to take into considerationthe count of each generalized data tuple (i.e., of each row entry)?(b) Use your algorithm to construct a decision tree from the given data.(c) Given a data tuple having the values “systems,” “26 . . . 30,” and “46–50K” for theattributes department, age, and salary, respectively, what would a na¨ıve Bayesianclassification of the status for the tuple be?
Solution
(a) Para modificar el algoritmo básico de árbol de decisión y considerar el conteo de cada tupla de datos generalizada, se puede ajustar el cálculo de las medidas de selección de atributos (como la ganancia de información, la ganancia de Gini, etc.) para que incluyan los conteos. En lugar de tratar cada tupla como una instancia única, se debe multiplicar la contribución de cada tupla por su conteo. Esto implica que, al calcular las frecuencias de las clases y los valores de los atributos, se deben usar los conteos en lugar de simplemente contar las tuplas.
(b) Para construir el árbol de decisión usando el algoritmo modificado, seguimos estos pasos:
- Calcular la entropía inicial del conjunto de datos completo.
- Para cada atributo, calcular la ganancia de información considerando los conteos.
- Seleccionar el atributo con la mayor ganancia de información como el nodo raíz.
- Dividir el conjunto de datos según los valores del atributo seleccionado.
- Repetir el proceso recursivamente para cada subconjunto, usando los conteos para calcular las medidas de selección de atributos.
Vamos a construir el árbol de decisión paso a paso:
-
Calcular la entropía inicial:
- Total de tuplas: 165
- Frecuencia de "senior": 52
- Frecuencia de "junior": 113
- Entropía inicial:
-
Calcular la ganancia de información para cada atributo:
-
Para el atributo "department":
- "sales": 110 tuplas (30 senior, 80 junior)
- "systems": 31 tuplas (8 senior, 23 junior)
- "marketing": 14 tuplas (10 senior, 4 junior)
- "secretary": 10 tuplas (4 senior, 6 junior)
- Calcular la entropía para cada valor y la ganancia de información.
-
Para el atributo "age":
- "21 . . . 25": 20 tuplas (0 senior, 20 junior)
- "26 . . . 30": 49 tuplas (0 senior, 49 junior)
- "31 . . . 35": 74 tuplas (35 senior, 39 junior)
- "36 . . . 40": 10 tuplas (10 senior, 0 junior)
- "41 . . . 45": 3 tuplas (3 senior, 0 junior)
- "46 . . . 50": 4 tuplas (4 senior, 0 junior)
- Calcular la entropía para cada valor y la ganancia de información.
-
Para el atributo "salary":
- "26K . . . 30K": 46 tuplas (0 senior, 46 junior)
- "31K . . . 35K": 40 tuplas (0 senior, 40 junior)
- "36K . . . 40K": 4 tuplas (4 senior, 0 junior)
- "41K . . . 45K": 4 tuplas (0 senior, 4 junior)
- "46K . . . 50K": 50 tuplas (40 senior, 10 junior)
- "66K . . . 70K": 8 tuplas (8 senior, 0 junior)
- Calcular la entropía para cada valor y la ganancia de información.
-
-
Seleccionar el atributo con la mayor ganancia de información como el nodo raíz y repetir el proceso para cada subconjunto.
(c) Para clasificar una tupla con los valores "systems," "26 . . . 30," y "46–50K" usando un clasificador bayesiano ingenuo, seguimos estos pasos:
-
Calcular las probabilidades a priori de cada clase:
- P(senior) = 52/165
- P(junior) = 113/165
-
Calcular las probabilidades condicionales para cada atributo dado cada clase:
- P(department = systems | senior) = 8/52
- P(department = systems | junior) = 23/113
- P(age = 26 . . . 30 | senior) = 0/52
- P(age = 26 . . . 30 | junior) = 49/113
- P(salary = 46K . . . 50K | senior) = 0/52
- P(salary = 46K . . . 50K | junior) = 50/113
-
Calcular la probabilidad posterior para cada clase usando la regla de Bayes:
- P(senior | department = systems, age = 26 . . . 30, salary = 46K . . . 50K) ∝ P(senior) * P(department = systems | senior) * P(age = 26 . . . 30 | senior) * P(salary = 46K . . . 50K | senior)
- P(junior | department = systems, age = 26 . . . 30, salary = 46K . . . 50K) ∝ P(junior) * P(department = systems | junior) * P(age = 26 . . . 30 | junior) * P(salary = 46K . . . 50K | junior)
-
Comparar las probabilidades posteriores y clasificar la tupla en la clase con la mayor probabilidad posterior.
Dado que P(age = 26 . . . 30 | senior) = 0, la probabilidad posterior para la clase "senior" será 0. Por lo tanto, la tupla se clasifica como "junior".
Similar Questions
Data of total number of employee in finance department which is more specific and numeric is?Continuous DataDiscrete DataQualitative DataOperative Data
ComprehensionThe following table represents the number of employees working in different departments of a pharmaceutical company over a period of five years. Study the data and answer the questions that follow. Years Production Marketing Corporate Accounts Research2007 150 25 50 45 752008 225 40 45 62 702009 450 65 30 90 732010 470 73 32 105 702011 500 80 35 132 742012 505 75 36 130 75 Question 2/2MandatoryUnderstanding Data TablesIn which of the following years did each department have a large number of employees than it did in the immediately preceding year?2010201220092011
The HR department needs to assess the distribution of skills among employees. Write a query to count the number of skills each employee possesses.Table details are given below:The table is created, and the records are already inserted at the backend. The sample records are given below.EMPLOYEE EMPLOYEE SKILLNote:Table names and Field Names are case-sensitive.Input format :The input records are already prepopulated, as given in the problem statement.Output format :The output displays the details of the employee id, employee Name and the number of skills as shown below.Employee_id Employee_Name Number_of_Skills1 John Doe 32 Jane Smith 13 Alice Johnson 04 Bob Miller 05 Eva Davis 16 Michael Clark 07 Sophie Wilson 18 David Lee 19 Megan White 010 Alex Turner 011 Olivia Harris 012 Daniel Brown 0
32. The table below shows the age of each employee at a cardboard box factory ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 AGE 51 39 34 27 23 43 41 55 24 25 38 17 21 37 35 38 31 24 Calculate the following summary statistics for the age feature using python (num py) code snippet: a. Minimum,Maximum, Range b.Mean and Medianc. Variance and Statnard deviation
Understanding Data TablesIn which of the following years did each department have a large number of employees than it did in the immediately preceding year?2010201220092011
Upgrade your grade with Knowee
Get personalized homework help. Review tough concepts in more detail, or go deeper into your topic by exploring other relevant questions.