Troubleshooting OEM Install Problems
Article by author Chris Foote
|
|
OEM Troubleshooting Overview
Like previous versions of Enterprise Manager,
the OEM 10G Grid Control is a multi-tier architecture consisting of
the HTML console, an OEM management service with an integrated
information repository and other OEM management agents running on
all monitored instances.
The OEM management service receives monitoring
data from the various agents and loads it into the management
repository. The OEM management console retrieves data from the
management repository, organizes it and then displays it as
information to the administrator via the HTML console interface.
The agents are programs that run continuously
on all servers that are controlled by the Enterprise Manager
architecture. Examples of the more popular targets found on the
servers are databases, application servers and listeners.
Agent Installation and Configuration
Differences
In Oracle OEM, the agent software is installed
separately from the database and application server. During the
agent software installation, the installer will prompt the user for
the node name of the server that is running the 10G management
service. The target agent then contacts the management server and
uploads its configuration information. A complete reversal of the
way the process was executed in previous releases.
OEM Agent Troubleshooting
But what if we lose communications between the
target agents and the management service? Although we are experts by
no means, we have done some troubleshooting from time to time.
If alerts and monitoring information isn't
being received by the management service, there are only a few
components that can be causing the communication breakdown. The
problem could be on the management server itself. We'll take a look
at troubleshooting that environment in an upcoming blog. The problem
may be in the network connectivity, or lack thereof. If you don't
have network connectivity, you probably have more problems than just
being unable to administer and monitor the target with 10G EM. More
than likely you will also be getting calls from irate users letting
you know that they can't access the databases on that server. In
that case, do what DBAs have been doing for years - blame the
network administration boys.
The remainder of this blog will give you a few
helpful hints on troubleshooting the target agent software. Let's
take a look at our Agent Administration panel in 10G Grid Control.
The agent administration screen lists all of the agents currently
active in the 10G Grid Control environment. Each line displays the
agent software version, status (up, down, problem), number of
targets that are using it and the number of targets that aren't
using the agent. Each agent's name is a link that allows the user to
view more detailed configuration information about that agent. If I
click on that link, EM will display a panel that allows me to drill
down to more specific information about that agent. The Agent Drill
Down panel provides detailed information on the agent's
configuration, status, resource utilization, targets monitored and
upload information
The most important piece of information on the
agent administration panel is the column titled 'Last Successful
Load'.
The next section of the agent administration
panel provides information on metric collection errors. Each line
contains information pertaining to the agent and when the collection
error occurred. If Oracle is able to provide additional information,
it will display the error message as a HTML navigation link that
allows the user to drill down into more specific information. I can
either roll my mouse over the navigation link or click on it to
display the more detailed error information.
Let's pick one of the errors and go through the
error determination process. We'll start with the problem that
occurred at the top of our Metric Errors Collection report. 10G Grid
Control was unable to run the Agent Process Statistics process on
Sept 3 at 6:46:27 AM.
The first step that needs to be executed when
performing agent error determination is to log on to the server
where the agent is running, navigate to the agent's home directory
and run a status command on the agent. The following command
displays pertinent information about the status and configuration of
the agent running on the target:
> emctl status agent
If you look at the output you will see that we
have loaded 606.39 MEGs of XML data to the management server and we
have 0 files and 0 MEGs of data pending upload. These numbers are
telling us that we have a functioning agent on this platform. If
your files pending upload increases continuously, you have an agent
to management server problem. If you want to see all of the
variations of the EMCTL command, type "emctl" at the prompt and do
not enter any other commands after it. Oracle will display a listing
of all the variations of the EMCTL command on the screen.
This next screenshot displays the agent's
parent directory structures. Notice that I am navigating to the
SYSMAN subdirectory. SYSMAN is the parent directory for the
subdirectories that contain the diagnostic information that we will
need to use as input to solve our problem. We'll be spending the
bulk of our time in this directory structure. Although there are
numerous files and subdirectories, I will be covering just the files
that we have been using to debug our environment. If I don't, I'll
have another one of the world's largest blogs.
The CONFIG subdirectory, contains files that
are used to tailor the agent to its host platform's configuration.
The emd.properties file is the main configuration file that we have
had to edit from time to time to fix a few of our issues. The
management service that the agent communicates with is identified in
the REPOSITORY_URL parameter. If you want to change the host or port
name of the management server that this agent communicates with, you
will have to edit the REPOSITORY_URL parameter to reflect the new
information. Later in this blog, I'll show you how to cleanse the
old information from the directories to allow the agent to
successfully connect to the new management server.
There is one other parameter in emd.properties
that may cause a few problems. If your agent NEVER successfully
uploads data after installation, check the TIMEZONE parameter in the
emd.properties file. It is usually the last parameter in
emd.properties. When our agent's time zones didn't match the time
zone in the management server's configuration file, we weren't able
to establish a successful connection between the agents and the
management service. So if you have agentTZRegion=America/New_York in
the agent's emd.properties file, you'll prevent a lot of headaches
if you use the same time zone in the management server's
emd.properties configuration file. We don't have any servers in
different time zones, but there is a wealth of information contained
in the 10G Grid Control documentation to help those that do. What I
do know is that when the time zones didn't match, we couldn't make a
connection.
The EMD subdirectory contains a few files and
subdirectories that are important to us. The UPLOAD subdirectory is
a holding area for files that will be uploaded to the management
server. Lastupld.xml is pretty self explanatory. It contains
information on agent uploads to the management server. The file that
you may be required to edit from time to time is the targets.xml
file. This file contains information about all of the targets
(databases, listeners, etc.) controlled by the agent on this
platform.
The DBSNMP user is the account that 10G Grid
Control uses to log on to the database to perform activities on
behalf of the 10G Enterprise Manager toolset. If you change the
password for DBSNMP, you will have to edit targets.xml to reflect
the new password. To do this, you change the value of ENCRYPTED to
FALSE and enter the new password after the VALUE column in the
PASSWORD line for that database. I have also had to change the
release identifier after upgrading a database being monitored by 10G
Grid Control. If some targets are showing up on a platform, while
others are not, you may want to review the contents of targets.xml
to see if it has entries for the missing targets. Once again, you
can use the cleanse script I'll provide you with later in this blog
to update targets.xml and upload that information to the management
server.
The LOG subdirectory will probably be the area
where you spend most of your time when you are evaluating a problem
with the agent. This subdirectory contains several files that will
assist you in the problem determination process. The most important
debugging files are:
emagent.nohup - Agent watchdog log file
emagent.log - Main agent log file
emagent.trc - Main agent trace file
emagentfetchlet.log - Log file for Java
Fetchlets
emagentfetchlet.trc - Trace file for Java
Fetchlets
There is an excellent bulletin on Metalink that
will provide you with detailed information on the contents of these
logs. The bulletin number is 229624.1. Let's take a look at the
contents of emagent.trc to see if we can find the problem. I have
scrolled to the times that match the error that was displayed on the
top of our error metrics report. The file contains dozens of lines
before and after the time of the error stating that we are having an
out-of-memory condition. It looks like we have to perform some
further error-determination activities to solve the memory problem
before we can fix the agent. This out-of-memory condition is most
certainly affecting our database connections so I'll need to fix it
soon. Lucky for me it is our DBA "playpen" that we use for our own
testing (or I would have a bunch of irate users and/or developers
clamoring for my head).
The Cleansing Script
If you go to Metalink and do a search on the
value 'sysman/emd/state/*' or document ID 303105.1, you'll see a
series of commands that can be used to "cleanse" the environment as
we like to describe it here at Giant Eagle. When you get a
continuous 'status pending' message for a monitored target, change
the name of the management server, reinstall the agent software,
remove the agent from the management server and re-add it (and a
host of other activities), Oracle recommends that you perform the
following steps at the end:
1. Stop the agent on the target node
2.emctl stop agent
2. Delete any pending upload files from the
agent home
rm -r $ORACLE_HOME/sysman/emd/state/* rm -r $ORACLE_HOME/sysman/emd/collection/*
rm -r $ORACLE_HOME/sysman/emd/upload/* rm $ORACLE_HOME/sysman/emd/lastupld.xml
rm $ORACLE_HOME/sysman/emd/agntstmp.txt rm $ORACLE_HOME/sysman/emd/blackouts.xml
3. Issue an agent clearstate from the agent
home
emctl clearstate
4. Start the agent
emctl start agent
5. Force an upload to the OMS
emctl upload
If you change the name or port of the
management server, you will need to run these commands on all
platforms that are running the 10G agents. Because we have been
moving our 10G EM Management Service from one server to another
during our testing and implementation, we have created a script that
automates the above commands. We have also used the above series of
commands as a last resort when all other debugging avenues have
failed. The script just seems to fix a lot of agent to server
communication problems. Set your ORACLE_HOME to the Oracle agent's
home directory and run the script. If you execute it by typing
clean_up_oms.sh with no arguments, it will display the current
ORACLE_HOME, which is required to be set to the agent's home
directory to allow the script to process successfully.
|